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INTRODUCTION 


Research  conducted  under  this  contract  has  concerned  context  effects 
in  differential  Judgments.  The  principal  method  employed  has  been  the  Up- 
and-Down  Method.  A  starting  premise  was  that  this  method  would  have  distinct 
advantages  over  the  traditional  method  of  constant  stimuli  for  the  study  of 
any  kind  of  judgment  bias,  and  hence  for  the  study  of  context  effects  in 
particular.  Because  this  work  represented  an  initial  application  of  the  up- 
and-down  method  to  the  study  of  differential  discrimination,  it  was  desir¬ 
able  that  we  devote  some  attention  to  the  method  as  such.  Our  report  there¬ 
fore  falls  into  two  main  parts.  The  first  of  these  discusses  the  up-and- 
down  method,  various  ways  in  which  we  employed  it,  our  specific  attempts 
to  improve  its  usefulness  in  discrimination  studies,  and  our  comparative 
tests  of  this  method  and  the  method  of  constant  stimuli.  The  second  part 
of  the  report  discusses  our  experimental  results  on  the  context  problem: 
our  studies  of  context  effects  in  the  multiple  standards  situation,  and  our 
examination  of  the  effects  which  Individual  trials  or  sequences  of  previous 
trials  have  upon  a  current  discrimination. 

1.  the  differential  judgment  situation. 

Our  basic  test  situation  has  been  one  in  which  the  subject  is  required 
to  compare  two  stimuli  and  to  Indicate  whether  the  second  is  heavier  or 
lighter  than  the  first,  louder  or  leas  loud,  longer  or  shorter,  nearer  or 
farther  away.  In  all  cases,  the  subject  has  been  limited  to  two  oppesed 
judgments,  as  just  illustrated.  Judgments  of  "equal"  were  never  allowed. 

Throughout  the  discussions  which  follow,  we  refer  to  the  function  which 
relates  the  probability  of  either  of  these  alternative  responses  to  the 
stimulus  dimension  under  study  as  the  "psychometric  function"  or  the  "prob¬ 
ability  of  response  curve".  He  designate  the  first  stimulus  in  the  pair  as 
the  "standard"  and  the  second  as  the  "comparison".  He  use  the  terms  "point 
of  objective  equality"  (POE)  and  "point  of  subjective  equality"  (PSE)  in  the 
conventional  way,  POE  referring  to  that  comparison  stimulus  which  is  object¬ 
ively  equal  to  the  standard,  and  PSE  referring  to  that  comparison  stimulus 
which  the  subject  Judges  to  be  the  equal  of  the  standard.  Judgment  bias  is 
given  by  the  quantity  (PSE-POE) . 
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2.  CONTEXT,  AND  CONTEXT  EFFECTS. 

Every  Judgment  or  response  to  particular  stimuli  is  made,  we  shall  say, 
in  the  context  of  preceding  and  concurrent  stimuli  and  in  the  context  of 
preceding  and  concurrent  responses.  This  is  to  assert,  for  the  typical 
laboratory,  psychophysical  experiment,  that  the  judgment  on  any  individual 
trial  is  made  within  the  context  of  previous  trials  and  the  testing 
situation. 

We  are  going  to  use  the  term  "context  effect"  throughout  this  report  to 
apply  to  any  bias  of  a  present  judgment  which  is  a  function  of  previous 
trials. 

Other  authors  have  at  times  used  expressions  such  as  "series  effect" 
(e.g.,  Woodworth  and  Schlosberg,  1954), "central  tendency  effect"  (e.g. 
Holllngworth,  1910),  negatively  directed  constant  error  function  (Koester 
and  Schoenfeld,  1946),  or  "response  dependencies"  (e.g.  Senders  and  Sowards, 
(1952),  to  refer  to  portions,  aspects  or  features  of  what  we  here  call 
"context  effect" .  Our  preference  for  "cor.text  effect"  stems  from  our  inter* 
est  in  having  a  general  term  to  apply  to  any  and  all  effects  of  prior  trials 
on  a  current  judgment.  Ue  thus  bring  together  and  include  under  this  term 
effects  which  have  not  necessarily  been  considered  together  in  the  liter¬ 
ature.  We  include  effects  which  are  session-long  or  long-range,  in  that 
they  are  associated  with  the  entire  collection  of  prior  trials,  as  well  as 
effects  which  are  very  short-range  in  that  they  are  associated  with  the 
immediately  preceding  trial  or  at  most  a  few  preceding  trials. 

3.  STIMULUS  CONTEXT  AND  JUDGMENT  CONTEXT. 

We  find  it  convenient  in  some  discussions  to  distinguish  between 
stimulus  context  on  the  one  hand, and  judgmeit  or  response  context  on  the 
other.  Stimulus  context  consists  of  the  stimuli  presented  over  some  pre¬ 
scribed  set  of  prior  trials,  whatever  set  hippens  to  be  of  interest  in  the 
study.  Judgment  context  consists  similarly  of  those  responses  made  over 
some  set  of  prior  trials  of  Interest  in  the  study.  Although  stimulus  con¬ 
text  and  judgment  context  typically  cannot  be  manipulated  independently  of 
each  other,  the  experimenter  may  so  design  his  experiment  that  he  controls 
one  or  the  other  specifically,  or  in  some  special  cases  controls  them 
jointly. 
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The  method  of  constant  stimuli,  and  any  other  psychophysical  method 
which  employs  a  pro-arranged  stimulus  series  permits  the  direct  control  of 
stimulus  context.  Session-long  stimulus  context  is  controlled  and  can  be 
manipulated  in  terms  of  the  session-long  distribution  of  comparison  stimuli. 
In  most  experiments  by  the  method  of  constant  stimuli,  this  distribution  of 
comparison  stimuli  is  rectangular  and  centered  about  the  POE,  but  the  effect 
of  altering  this  distribution,  i.e.  of  altering  session-long  stimulus  con¬ 
text,  has  been  investigated  (e.g.  Harris,  1948).  Short-term  stimulus  con¬ 
text,  on  the  other  hand,  may  be  varied  and  studied  by  organizing  the  pre¬ 
arranged  stimulus  series  in  such  a  way  that  particular  comparison  stimuli 
are  used  on  trials  preceding  those  of  interest.  (See  Fernberger,  1920; 
Koester  and  Schoenfeld,  1946). 

Experiments  dealing  with  short-term  judgment  context  have  made  a 
relatively  recent  appearance  in  the  literature,  typically  under  titles 
referring  to  the  non-independence  of  successive  responses  (e.g.  Verplanck 
et  al.  1952) .  Problems  concerned  with  long-term  or  session-long  judgment 
context  on  the  other  hand  have  attracted  little  if  any  Interest.  Perhaps 
this  will  change,  however,  now  that  we  have  a  formal  way  of  controlling 
session-long  judgment  context  through  the  up-and-down  method.  The  sequential 
nature  of  this  method  makes  it  possible  to  regulate  or  control  the  session¬ 
wide  distribution  of  judgments  which  the  subject  makes  (in  a  two-alternatives 
response  situation).  Normal  application  of  the  method  assures  that,  in  the 
long  run,  the  subject  will  use  an  equal  number  of  "greater  than"  and  "less 
than"  judgments.  Modifications  of  the  method  provide  situations  in  which 
the  ratio  of  the  number  of  "greater  than"  judgments  to  "less  than"  Judgments 
will  be  2:1,  3:1,  etc. 

In  the  foregoing  terms,  the  method  of  constant  stimuli  is  a  method  in 
which  session-wide  stimulus  context  is  controlled  by  the  experimenter,  and 
the  Judgment  distribution  or  judgment  context  is  a  function  of  the  subject's 
responses  to  those  stimuli.  The  up-and-down  method,  on  the  other  hand,  is 
a  method  in  which  session-wide  Judgment  context  is  controlled  in  terms  of 
the  sequential  program  which  the  experimenter  adopts  for  the  study,  and  the 
stimulus  distribution  or  stimulus  context  is  a  function  of  the  subject's 
discriminative  behavior.  To  compare  the  method  of  constant  stimuli  and  the 
up-and-down  method,  then,  is  to  a  compare  the  effect  of  controlling 
stimulus  context  on  the  one  hand  and  of  controlling  judgment  context  on  the 
other . 
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4.  ORDER  OF  TOPICS  AMD  EXPERIMENTS  IN  THIS  REPORT. 

The  work  of  this  project  extended  from  July  1957  to  August  1961. 
Related  work  using  the  up-and-down  method  had  been  undertaken  in  our  lab¬ 
oratory  as  early  as  1952.  The  report  which  follows  includes  a  discussion 
of  some  of  these  "pre-project"  studies,  when. they  are  relevant,  as  well 
as  the  project  research. 

The  report  is  organized  not  chronologically,  but  by  topics  or  problems 
This  means  that  early,  exploratory  studies  with  smaller  groups  of  subjects 
are  occasionally  interspersed  with  later,  more  thorough  studies.  It  means 
also  that  our  entire  discussion  of  the  up-and-down  method  precedes  the 
research  portion  of  the  report,  even  though  not  all  of  our  context  studies 
benefited  from  planning  based  upon  all  that  we  now  know  about  the  method. 

We  believe,  however,  that  the  chosen  organization  will  be  found  the  more 
useful. 
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PART  I:  THE  UP-AND-DOWN  METHOD 


1.  ORIGINS  AND  GENERAL  DESCRIPTION  OF  THE  UP-AND-DOWN  METHOD 

The  Up-and-Down  Method  was  developed  by  non-paychologists,  but  early 
discussions  of  the  use  of  the  procedure  recognized  its  potential  application 
to  a  wide  range  of  measurement  problems,  Including  those  of  psychophysics 
(Anon.,  1944;  McCarthy,  1947).  The  up-and-down  method  is  a  general  measure¬ 
ment  method,  and  like  other  measurement  methods  with  which  the  psychologist 
is  familiar,  it  is  both  a  routine  for  collecting  observations  and  a  numerical 
or  statistical  procedure  for  deriving  desired  measurements  from  the  data. 

In  this  particular  case,  the  method  of  data  collection  was  introduced  by 
members  of  the  Explosives  Research  Laboratory  at  Bruceton,  Pennsylvania, 
while  the  associated  statistical  procedures  were  developed  by  the  Statistical 
Research  Group  of  Princeton  University  (Anon. ,  1944;  Anderson,  McCarthy  and 
Tukey,  1946).  The  unique  feature  of  the  method  is  the  special  sequential 
character  of  its  data  collecting  operation.  A  consequent  and  highly  desir¬ 
able  property  of  the  method  is  its  efficiency:  relatively  few  trials  or 
observations  are  required  per  measurement. 

The  up-and-down  method  applies  in  situations  where  two  alternative 
responses,  X  and  Y,  are  possible.  These  alternative  responses  may  be  of  the 
sort  "detection"  vs.  "non-detection",  "second  weight  lighter  than  the  first" 
vs .  "second  weight  heavier  than  the  first",  etc.  We  arrange  suitable 
stimulus  conditions  for  the  first  trial  and  conduct  that  trial.  If  response 
X  occurs,  then  according  to  the  up-and-down  method,  our  next  trial  is  run 
under  stimulus  conditions  which,  by  one  step  on  our  stimulus  scale,  are 
more  favorable  to  the  occurrence  of  response  Y.  If,  on  the  other  hand, 
response  Y  occurs,  our  next  trial  is  run  under  stimulus  conditions  which, 
by  one  step  on  the  stimulus  scale,  are  more  favorable  to  the  occurrence  of 
response  X.  Throughout  a  complete  series  of  trials,  each  successive  trial 
is  scheduled  on  the  basis  of  these  same  rules.  This  programming  of  trials 
is  said  to  be  "sequential”  because  the  stimulus  conditions  used  on  each  new 
trial  are  contingent  on  the  outcome  of  the  previous  trial. 

The  up-and-down  routine  just  outlined  presumes  that  the  probability 
of  response  X  increases  monotonically  from  0  to  1.00  (  and  that  the  probabil¬ 
ity  of  response  Y  decreases  from  1.00  to  0),  as  successive  steps  are  taken 
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along  Che  stimulus  scale.  When  Che  psychometric  function.  Is  of  this 
character,  then  the  up-and-dovn  method  assures  that  the  series  of  trials 
will  see-saw  up  and  down  the  stimulus  scale,  and  that  most  of  the  trials 
will  be  conducted  in  the  vicinity  of  that  stimulus  level  where  the  prob¬ 
ability  of  response  X  and  the  probability  of  response  Y  are  equal.  Very 
few  trials  are  conducted  at  extreme  stimulus  levels,  because  whenever  the 
trial  series  moves  to  such  a  level,  the  response  which  Is  there  highly 
probable  drives  the  series  back  to  less  extreme  levels. 

When  the  series  of  observations  has  been  terminated,  application  of 
the  numerical  procedures  developed  by  the  Princeton  Research  Group  provides 
an  estimate  of  the  stimulus  level  which  marks  the  50X  point,  or  point  of 
transition  from  response  X  to  response  Y.  Further  calculations  lead  to 
an  estimate  of  the  variability  of  behavior  in  the  region  of  this  transition. 
These  estimates  are  based  on  two  assumptions:  that  the  psychometric  function 
is  a  cumulative  normal,  and  that  the  experimenter  has  chosen  equally  spaced 
stimulus  levels  on  the  scale  which  provides  that  normality. 

It  will  have  been  apparent  from  this  description  that  the  up-and-down 
method  uses  discrete  stimulus  values.  The  stimuli  on  any  given  trial  are 
fixed  and  unchanging  during  that  trial.  On  this  basis  the  method  may  be 
classed  as  one  of  the  "constant  stimulus"  methods.  It  may  also  be  class¬ 
ified  as  one  of  the  "frequency  methods"  In  that  measurements  obtained  by 
the  method  are  computed  from  the  observed  frequency  of  occurrence  of  the 
alternative  responses  under  each  stimulus  condition.  By  the  Princeton 
Research  Group  It  was  identified  as  a  member  of  the  class  of  "staircase 
methods",  l.e.  methods  of  a  sequential  sort  which  Include  among  others, 
the  method  of  limits  (See  Anderson  e_t  al,  1946). 


2.  SOME  SIMILAR  PROCEDURES. 

It  was  not  long  after  the  development  of  the  Bruceton-Princeton  proced¬ 
ures  that  an  up-and-down  scheme  for  stimulus  control  In  a  psychophysical 
situation  was  Independently  devised  and  Introduced  by  Bekesy  (1947)  and  by 
Oldfield  (1949). 

Bekesy  incorporated  the  scheme  In  his  "new  audiometer".  In  this  audio¬ 
meter,  the  intensity  of  the  signal  for  which  the  patient  was  listening  was 
increased  by  2  db.  steps  every  .86  seconds  as  long  as  the  patient  kept  a 
response  key  depressed  indicating  that  he  heard  nothing.  The  signal 
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intensity  was  decreased  by  similar  steps  during  all  periods  when  the  key 
was  released  indicating  that  the  patient  could  hear  the  signal.  All  up  and 
down  shifts  of  signal  intensity  from  levels  just  above  threshold  to  levels 
just  below  threshold  were  recorded  graphically.  Then  as  the  audiometer 
signal  was  caused  to  sweep  the  auditory  frequency  range  from  one  end  to  the 
other,  the  recorder  traced  out  a  saw-toothed  but  easily  Interpreted  picture 
of  the  patient' 8  audiogram.  Because  of  its  great  convenience,  this  audio¬ 
meter  has  found  widespread  use  in  clinical  and  in  research  work.  Newer 
forms  of  it  permit  the  use  of  rates  of  intensity  change  up  to  5  db.  per 
second  in  steps  of  .25  db. 

Oldfield's  concern  was  with  measurements  of  the  absolute  visual  thres¬ 
hold.  He  devised  a  continuous,  motor-driven  intensity  control  which  altered 
the  intensity  of  the  visual  stimulus  at  the  rate  of  one-half  log  unit  per 
second.  The  subject  kept  a  response  button  depressed  as  long  as  the 
stimulus  was  visible,  thereby  causing  the  motor  to  decrease  the  stimulus 
Intensity.  When  the  stimulus  was  no  longer  visible,  the  subject  released 
the  button.  This  reversed  the  motor  which  then  gradually  returned  the 
stimulus  intensity  to  higher  levels.  Again,  as  in  Bekesy's  audiometer, 
the  subject's  response  kept  the  stimulus  in  the  vicinity  of  the  threshold. 

Thus  Bekesy  and  Oldfield  hit  upon  methods  of  data  collection  which 
were  conceptually  similar  to  the  testing  procedure  of  the  Bruceton  group. 
Stimulus  Intensity  at  any  moment  was  a  function  of  the  subject’s  response 
to  the  stimulus  intensity  which  had  prevailed  a  moment  before,  and  the 
stimulus  control  "hunted"  in  the  vicinity  of  that  stimulus  level  where 
the  probability  of  detection  was  .50. 

Three  rather  Important  differences  in  method  and  objectives  are  to  be 
noted,  however,  between  the  work  at  Bruceton  and  that  in  the  Bekesy  and 
Oldfield  laboratories: 

(1)  The  Bruceton  procedure  Involved  a  series  of  discrete  trials, 
whereas  the  task  for  the  observers  in  the  Bekesy  and  Oldfield  situations  was 
a  continuous  observing  task. 

(2)  The  object  of  the  Bruceton  procedure  was  to  get  a  reasonable 
estimate  of  the  parameters  of  the  probability  of  response  function  at 
minimum  cost,  that  is  within  a  minimum  number  of  trials  and  within  a  very 
limited  number  of  up-and-down  crossings  of  the  507.  point.  This  required  a 
formal,  numerical  analysis  of  the  data.  For  Bekesy  and  Oldfield,  on  the 
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other  hand,  the  rate  of  change  of  stimulus  intensity  was  sufficiently  rapid 
and  the  intended  observation  period  sufficiently  long  that  the  stimulus 
series  would  clearly  cross  the  50X  point  a  great  many  times.  With  many 
stimulus  reversals  to  mark  the  threshold,  no  elaborate  or  specific  method 
of  threshold  computation  was  necessary.  In  fact,  for  many  purposes  the 
graphic  record  has  often  been  sufficient  in  itself. 

(3)  The  Interest  of  the  Bruceton  group  was  in  measurement  with 
regard  to  some  fixed  probability  of  response  function,  while  Bekesy  and 
Oldfield  were  concerned  with  extending  their  observations  long  enough  in 
time  to  observe  changes  in  the  SOX  point  as  a  function  of  some  variable- 
sound  frequency  in  Bekesy's  case,  observing  time  in  Oldfield's  case.  It  la 
for  this  reason  that  the  Bekesy-Oldfield  procedure  is  frequently  referred 
to  as  a  "tracking  method".  It  permits  one  to  follow  or  track  changes  in 
the  value  of  a  sensory  parameter.  Of  Interest  Is  the  fact  that  an  historically 
earlier  use  of  the  same  basic  procedure  in  the  area  of  medicine  also  had  a 
tracking  emphaa  is  —the  tracking  of  blood  pressure  changes  under  various 
conditions  of  activity  on  the  part  of  the  subject  or  patient  (See  Lange, 

1943).  By  contrast,  the  computational  procedures  of  the  up-and-down 
method  assume  that  all  observations  have  been  drawn  from  the  same  unchanging 
population— that  one  set  of  response  probabilities  applies  for  the  entire 
record  being  analyzed. 

Psychological  research  over  the  past  IS  years  has  seen  increasingly 
frequent  adaptations  of  the  Bekesy-Oldfiald  type  of  procedure.  Both  step¬ 
wise  (e.g.  Gourevitch  et  al.  1960)  and  continuous  (e.g.  Blough,  1956,  1957, 
Evans,  1961)  stimulus  control  have  been  used.  Unequal  stepping  in  the  two 
directions  has  been  employed  (e.g.  Koh  end  Teitelbaum,  1961)  as  well  as  the 
more  usual  equal  stepping.  When  it  has  been  desirable  to  quantify  the 
graphic  records,  suitable  methods  for  computing  thresholds  have  been  devised 
by  Individual  authors  (e.g.  Blough,  1957;  Loeb  and  Dickson,  1961;  Koh  and 
Teitelbaum,  1961)  .  Clearly  the  Bekesy-Oldfleld  technique  is  already  well 
established  as  a  most  efficient  way  to  collect  data  in  a  wide  variety  of 
test  situations,  the  not  too  cogent  concerns  of  Brown  and  Cane  (1959)  not¬ 
withstanding. 

Under  some  of  the  foregoing  variations,  the  Bekesy-Oldfield  technique 
merges  with  the  up-and-down  method  at  the  level  of  the  programming  of  trials, 
but  we  shall  maintain  distinctions  here  by  identifying  the  up-and-down  method 
with  its  specific  computational  routines.  So  considered,  we  recognise  that 
the  up-and-down  method  has  been  used  and  examined  relatively  little  by 
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psychologlata .  Probably  ths  primary  reason  for  this  is  the  fact  that  until 
recently  (Cornsweet,  1962)  the  only  generally  available  discussion  of  the 
method  to  appear  ie  that  given  in  the  statistics  text  by  Dixon  and  Massey 
(1951,  1957). 

Let  us  turn  then  to  an  account  of  some  of  the  important  features  and 
properties  of  the  up-and-down  method. 

3.  ADVANTAGES  SEEN  FOR  THE  UP-AND-DOWN  METHOD. 

Among  the  advantages  which  are  seen  for  the  up-and-down  method  over  other 
psychophysical  methods  of  the  "constant"  group,  we  may  list  the  following. 

(1)  Flexibility  and  simplified  experimental  planning:  the  up-and- 
down  method  assures  that  the  series  of  test  trials  will  itself  "hunt"  for 
that  stimulus  level  which  marks  the  transition  from  response  X  to  response 
V,  so  that  pilot  studies  to  locate  this  level  approximately  are  unnecessary. 

(2)  Efficiency:  measurements  of  any  given  reliability  should  be 
obtained  in  fewer  trials  by  the  up-and-down  method  than  by  other  methods 
(see  Brownlee,  et  al ,  1953). 

(3)  Opportunity  to  determine  several  limens  concurrently:  because 
of  the  need  for  fewer  trials  per  measurement,  several  concurrent  measures 
may  be  taken  in  the  same  experimental  session  through  the  use  of  separate, 
concurrent  up-and-down  series  (see  especially  Part  II  of  this  report)  . 

(4)  Simplified  automatic  programming:  successive  trials  for  a 
single  up-and-down  aeries  (or  for  each  of  several  concurrent  series)  may  be 
programmed  through  the  use  of  an  add-and-subtract  stepper  (see  Appendix) . 

One  further  advantage  which  may  be  anticipated,  but  which  requires  demon¬ 
stration,  is  that  the  up-And-dovm  method  should  be  relatively  free  from  the 
kind  of  bias  which  arises  in  the  method  of  constant  stimuli  from  experimenter- 
determined  stimulus  context  (see  page  3  above) .  It  was  indeed  this  parti¬ 
cular  bias  question  which  we  first  explored  and  which  led  us  into  our 
general  study  of  context  effects  employing  the  up-and-down  method. 

Because  all  of  our  studies  have  dealt  with  some  aspect  of  the  problem 
of  bias  in  differential  judgment,  the  following  discussion  of  details  of 
the  up-and-down  method  is  concerned  with  the  manner  in  which  the  method  may 
be  adapted  most  effectively  to  the  study  of  differential  discrimination. 
Extensions  of  the  discussion  to  other  applications  should  be  obvious  and 
therefore,  with  only  a  few  exceptions,  will  be  left  without  comment. 
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4.  ILLUSTRATIVE  APPLICATION  OF  THE  UP-AND-DOWN  METHOD  IN  THE  COLLECTION  OF 
DATA  ON  DIFFERENTIAL  JUDGMENT. 

Let  us  suppose  a  test  situation  in  which  a  subject  hears  two  tones  in 
temporal  succession.  He  must  report  whether  the  second  tone,  the  comparison, 
is  louder  or  softer  than  the  first  tone,  the  standard.  For  successive  trials 
in  the  up-and-down  series,  the  standard  is  always  the  same  in  intensity, 
say  90  db.  The  comparison,  however,  may  be  at  any  one  of  a  number  of  pre¬ 
arranged  intensity  levels.  One  of  these  levels  is  chosen  for  the  first 
trial,  say  94  db.  If  the  subject  reports  the  comparison  on  that  trial  to 
be  louder  than  the  standard,  the  experimenter  conducts  the  second  trial  with  a 
comparison  which  is 'one  Step,  say  2  db.,  below  that  used  on  the  first  trial. 

If  the  subject  reports  the  first  comparison  to  be  softer  or  less  loud  than 
the  standard,  the  comparison  used  on  the  second  trial  is  one  step  more  in¬ 
tense  than  that  used  on  the  first  trial.  In  a  similar  way  the  comparison 
level  for  every  subsequent  trial  is  dependent  upon  the  subject's  report  on 
the  just  previous  trial. 

A  series  of  25  differential  judgment  trials  conducted  according  to 
this  routine  is  shown  in  Figure  1.  Successive  trials,  are  numbered  from 
left  to  right  across  the  figure.  Intensity  levels  of  the  comparison 
stimulus  are  shown  at  the  left.  The  standard  was  90  db.  On  the  first  trial, 
-he  comparison  stimulus  was  94  db.  and  it  was  judged  louder  than  the  stand¬ 
ard,  so  an  "L"  is  shown  for  trial  1  opposite  94  db.  The  second  trial, 
using  a  comparison  of  92  db. ,  also  led  to  a  judgment  of  louder,  so  the  third 
trial  was  conducted  using  a  comparison  of  90  db.  The 

first  judgment  of  softer  came  when  88  db  was  used  as  the  comparison,  so  far 
the  fifth  trial  the  comparison  was  raised  back  to  90  db.  The  remaining 
trials  in  the  series  continued  in  this  same  manner  and  involved  comparison 
intensities  which  ranged  between  84  and  90  db. 

Clearly  over  any  series,  such  as  that  shown  in  the  figure,  the  comparison 
stimulus  intensity,  being  contingent  upon  the  subject's  responses,  will 
shift  up  and  down  Irregularly,  but  will  never  wander  too  far  away  from  that 
(.oc.v.'rison  level  which  appears  equal  to  the  standard  in  loudness.  By  the 
very  nature  of  the  sequantlal  character  of  the  stimulus  series,  trials  are 
concentrated  at  those  stimulus  levels  where  the  subject  is  shifting  from 
judgments  of  "louder"  to  Judgments  of  "softer”.  For  the  present  data,  this 
point  of  shift  is  at  about  87  db. 
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Figure  1:  Illustrative  record  showing  application  of  the  up-and-down  method  In  a  loudness 
discrimination  situation.  The  standard  for  the  series  Is  a  tone  of  90db.  Comparison  tones  available 
differ  in  steps  of  2db.  The  record  shows  the  outcome  of  each  trial  over  a  series  of  25  trials.  L 
means  that  the  subject  reported  the  comparison  to  be  louder  than  the  standard,  S  that  he  reported  it 
to  be  softer. 
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To  the  right  in  the  figure,  three  summary  frequency  distributions  ore 
shown.  The  first  indicates  the  frequency  with  which  trials  were  conducted 
at  each  of  the  comparison  stimulus  levels.  The  second  indicates  the  fre¬ 
quency  distribution  of  trials  on  which  the  subject  reported  "louder". 

These  are  the  reports  which  drove  the  up-and-down  aeries  down  for  subsequent 
trials.  In  the  third  column  is  shown  the  frequency  distribution  of  the 
trials  on  which  the  subject  reported  "softer".  These  reports  drove  the 
series  up  for  all  subsequent  trials.  Note  that  the  frequency  distributions 
in  the  last  two  columns  are  very  much  alike:  in  fact  they  are  only  different 
to  the  extent  that  the  series  does  not  end  with  a  response  which  would  put 
the  next  trial  back  at  the  very  same  level  where  the  series  started.  How 
similar  these  distributions  are  depends  in  part  on  whether  the  experimenter 
happens  to  start  the  series  at  a  level  which  is  close  to  the  subject^  point 
of  subjective  equality  (PSE) .  In  the  present  case  it  is  clear  that  the 
series  started  above  the  PSE,  so  there  were  more  louder  judgments  in  the 
full  series  than  there  were  softer  judgments. 

As  it  turns  out,  a  feature  of  this  method  which  it  is  important  to 
recognize  is  that  the  smaller  the  size  of  stimulus  step  used  in  the  com¬ 
parison  series,  the  closer  the  trlels  are  concentrated  in  the  vicinity  of 
the  50-50  point.  In  order  to  observe  this,  it  is  first  necessary  for  us  to 
examine  the  manner  in  which  the  expected  distribution  of  trials  may  be<  cal¬ 
culated  from  a  known  psychometric  function. 

5.  THE  EXPECTED  DISTRIBUTION  OF  UP-AND-DOWN  TRIALS  AS  DERIVED  FROM  RESPONSE 
PROBABILITIES  AT  EACH  TEST  LEVEL. 

When  trials  are  conducted  at  successive  levels  up  and  down  the  stimulus 
scale  as  a  function  of  the  subjects's  responses,  the  frequency  distribution 
of  trials  over  those  levels  is  a  direct  function  of  the  subject's  prob¬ 
ability  of  judging  "louder"  and  "less  loud"  at  each  level.  Consider  the 
following  example: 

An  experimenter  chooses  four  stimulus  levels  at  which  the  proba¬ 
bilities  that  the  subject  will  respond  "louder"  and  "softer" are  those 
given  in  Table  1.  For  every  100  trlels  conducted  at  stimulus  level  3, 
the  expected  number  of  "louder"  judgments  is  90  and  of  "softer"  judgments 
is  10,  as  shown  in  columns  4  and  5  of  the  table.  The  10  judgments  of 
"softer"  each  precede,  under  the  up-and-down  routine,  trials  which  are 


Table  1 


COMPUTATION  OF  THE  EXPECTED  DISTRIBUTION  OP  TRIALS  IN  AN  UP-AND-DOWN  SERIES. 


Stimulus 

Level 

Response  Probabilities 

Expected  response 
frequencies  given 

100  trials  at  level  #3 

Expected 

frequency 

distribution 

for 

"louder" 

reaponse 

for 

"softer" 

response 

"louder" 

responses 

"softer" 

responses 

of  trials: 
sum  of  twi'. 
previous  cola. 

#4 

1.00 

.00 

10 

10 

#3 

.90 

.10 

90 

10 

100 

#2 

.25 

.75 

30 

90 

120 

#1 

.00 

1.00 

30 

30 

(For  discussion,  see  text) 
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conducted  at  level  4  But  there  the  probability  of  a  judgment  of  "louder"  is 
1.00,  and  ao  these  10  trials  are  all  followed  by  trials  which  are  back 
again  at  stimulus  level  3.  Clearly  if  100  trials  are  to  be  conducted 
altogether  at  level  3,  and  10  are  preceded  by  trials  at  level  4,  the  other 
90  must  be  preceded  by  "softer"  responses  at  level  2.  Hence  the  expected 
number  of  "softer"  responses  at  level  2  is  90.  These  90  responses,  how¬ 
ever,  constitute  .75  of  the  expected  total  number  of  trials  at  level 
number  2.  So  the  latter  must  be  120.  Zn  other  words,  the  expected  number 
of  "louder"  judgments  at  level  2  would  be  30.  Each  of  these  30  "louder" 
judgments  at  level  2  leads  the  experimenter  to  conduct  a  new  trial  at 
level  1.  All  of  these  30  trials  at  level  1  must  Involve  "softer" 
judgments,  because  at  this  level  the  probability  of  "softer"  is  given  as 
1.00. 

Thus  out  of  a  total  of  £60  trials,  conducted  at  the  levels  shown 
and  with  the  response  probabilities  indicated,  the  number  of  trials  expected 
at  each  of  the  stimulus  levels,  4,  3,  2,  and  1,  is  respectively  10,  100, 

120  and  30. 

This  serially  developed  calculation  may  be  applied  to  any  set  of 
probabilities  and  to  cases  with  any  number  of  stimulus  levels.  Hence,  it  is 
a  simple  matter  to  determine  the  expected  distribution  of  trials  for  any 
up-and-down  test,  given  the  psychcmetrlc  function  and  the  intended  test  levels. 

To  those  familiar  with  other  developments  In  measurement  theory,  it 

estimation  formulate  for 

will  corns  as  no  surprise  that  the  model  used  in  the  development  of/the  up- 
and-down  method  specifies  that  the  probability  of  response  curve  be  a 
cumulative  normal.  It  further  specifies  that  the  stimulus  levels  used  for 
testing  be  squally  spaced  on  the  stimulus  scale,  whatever  that  scale  be  which 
happens  to  provide  normality  of  the  cumulative  function.  Let  us  therefore 
apply  the  foregoing  computational  procedure  to  the  case  where  these  conditions 
of  normality  and  equal  step  side  are  met,  and  observe  the  effect  which  site 
of  step  has  upon  the  expected  distribution  of  up-and-down  trials. 

6.  CHANGES  IN  THE  EXPECTED  DISTRIBUTION  OF  UP-AND-DOWN  TRIALS  AS  A  FUNCTION 
OF  CHANGES  IN  STIMULUS  STEP  SIZE,  UNDER  THE  NORMAL  MODEL. 

Expected  distributions  of  trials  are  presented  in  Figure  2  for  five 
different  sizes  of  stimulus  step.  These  distributions  are  based  upon 
computations  by  Rappauf  and  Drucker.  The  step  sizes  range  from  0.5d  to 
3.0 o,  where  a  is  the  standard  deviation  of  the  cumulative  normal. 
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Value  of 
stimulus 
step: 


Distribution  of  up-and- 
down  trials  whan  11  co¬ 
incides  with  a  test 
level. 


Distribution  of  up-and- 
down  trials  when  t*  is 
midway  between  two  test 
levels. 


0.5  c 


1.0a 


1.5o 


2.0a 


3.0a 


-4a  -2a  n  +2a  +4a 


i 


I 

_ i _ 

.. . .. 

» 

!  ' 

_ 

- i 

-4a  -2a  lx  +2a  44a 


Stimulus  levels 


Stimulus  levels 


Figure  2:  Expected  distributions  of  up-and-down  trials  under  the 
normal  model,  for  different  sizes  of  stimulus  step  and  for  different  loca¬ 
tions  of  (1. 
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Two  extreme  forms  of  the  axpocted  trial  distribution  ara  of  intarest: 
ona  which  applias  when  tha  maan-madian,  p,  of  tha  probability  of  response 
curve  happens  to  fall  exactly  at  a  test  level,  and  the  other  which  applies 
whan  tha  mean-median  happens  to  fall  midway  between  two  test  levels. 

(Hots:  because  the  mean  and  median  of  the  cumulative  normal  are  Identical, 
we  designate  that  common  value  hare  as  tha  mean-median) .  Distributions  for 
both  of  these  cases  sra  presented  for  each  of  the  five  sires  of  stimulus 
step.  Distributions  for  other  locations  of  the  mean-median  would  clearly 
fall' between  these  shown  in  the  figure. 

a.  Details  of  computation.  In  the  computations  which  provided  the 
present  distributions,  response  probabilities  at  each  desired  stimulus 
level  were  read  from  the  table  of  the  cumulative  normal  to  three  decimal 
places.  Values  of  pin  excess  of  .999  were  rounded  to  1.000.  In  the  course 
of  plotting  the  distributions  for  Figure  2,  all  probabilities  smaller  than 
.005  were  dropped.  Each  distribution  is  plotted  on  a  cooinon  baseline  scaled 
in  units  of  the  standard  deviation  of  the  cumulative  normal. 

b.  Findings .  As  we  look  from  the  top  of  the  figure  to  the  bottom,  we 
see  immediately  that  when  step  sise  is  small,  i.e.  0.5  o,  the  expected 
trial  distribution  is  such  that  essentially  all  trials  are  conducted  within 
1.5  o  of  the  mean-median.  As  step  site  grows  larger,  the  expected  trial 
distribution  has  greater  variability  until  finally  when  the  step  is  3.0  o, 
trials  are  conducted  with  reasonable  frequency  at  stimulus  levels 

3.0  a  and  more  from  the  mean-median. 

A  second  fact  which  emerges  from  the  figure  is  that  when  step  size  is 
small,  the  expected  trial  distribution  makes  use  of  some  6  or  7  stimulus 
levels,  whereas  when  step  size  is  as  large  as  3  o  the  mean-median  is  easily 
bracketed  in  a  few  jumps  and  the  expected  trial  distribution  makes  use  of 
only  some  3  or  4  stimulus  levels.  (Given  a  very  large  stimulus  step,  of 
course,  only  2  stimulus  levels  would  be  used). 

A  third  fact  is  that  the  expected  trial  distribution  is  not  simply  the 
de-cumulated  probability  of  response  curve.  The  expected  trial  distributions 
happen  to  look  approximately  normal  (as  may  be  observed  by  plotting  them  on 
normal  probability  paper) ,  but  they  differ  considerably  among  themselves  in 
variance.  Vhen  the  step  size  is  less  than  1.0  a,  the  variance  of  the 
expected  trial  distribution  is  less  than  1.0  on  the  scale  of  or  for  the 
cumulative  normal.  Similarly,  vhen  the  step  size  is  greater  than  1.0  e,  the 
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variance  of  Che  expected  trial  dlatrlbution  Is  greater  than  1.0  on  the  scale 
of  c. 

c.  Implications.  The  foregoing  facts  have  three  Implications  for 
several  sections  which  follow  below.  These  are: 

(1)  Re_ the  estimation  of  the  standard  deviation  of  the  probability 

of  response  curve:  The  estimation  formula  will  not  be  baaed  simply  on  the 
standard  deviation  of  the  obtained  trial  distribution,  because  the  variability 
of  the  latter  is  not  invariant  but  depends  in  a  critical  way  upon  step  else. 

(2)  Re  the  effect  of  skewness  of  the  probability  of  response  curve: 

The  smaller  the  step  size,  the  greater  the  probability  that  the  obtained 
trial  distribution  for  an  up-and-down  aeries  will  cluster  around  the  median 
and  never  get  out  into  the  tails  of  the  probability  of  response  curve.  On 
the  other  hand,  the  larger  the  step  size,  the  greater  the  probability  that 
the  entire  probability  function  will  be  scanned,  with  some  trials  being 
conducted  at  stimulus  levels  which  mark  the  tails  of  the  function.  These 
facts  suggest  that  whatever  the  estimate  is  which  is  used  for  the  mean- 
median  of  the  cumulative  normal,  It  will  require  some  special  examination  if 
applied  to  a  situation  where  the  response  function  is  skewed. 

(3)  Re  the  difficulty  of  the  subject's  differential  discriminations: 

If  a  small  step  size  is  used,  the  comparison  stimulus  will  almost  always 

be  very  close  to  the  point  of  subjective  equality,  so  that  the  difficulty 

of  the  series  of  trials  will  be  very  high.  On  the  other  hand,  if  a  large 
step  is  chosen,  many  of  the  Judgments  will  be  easy  ones  for  the  subject  to 
■take  because  the  comparison  stimuli  will,  in  large  number,  be  quite  far 
removed  from  the  FSE. 

7.  ESTIMATES  OF  ^  AND  a  OF  THE  PROBABILITY  OF  RESPONSE  CURVE,  BASED  ON  THE 
NORMAL  MODEL. 

a.  The  estimates  given  bv  the  Princeton  Research  Group.  The  following 
estimates  are  described  in  a  variety  of  sources:  Anon.  (1944) ,  Anderson, 
et  al  (1946),  Dixon  and  Mood  (1948),  Dixon  and  Massey  (1951,  1957). 

and  o  are  estimated  on  the  basis  of  the  frequency  distribution  of  the 
responses  X,  ££  on  the  basis  of  the  frequency  distribution  of  the  responses  Y, 
whichever  response  occurred  less  frequently  in  the  up-and-down  series. 
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Estimate  of  p: 
(i.e.  Che 
mean-median) 


m 


mean  stimulus  level  for 
*  Che  distribution  of  the 
lcsa  frequent  response 


where: 

+  applies  if  the  distribution  is  for 
the  responses  "second  stimulus  softef" 
-  applies  if  the  distribution  is  for 
the  responses  "second  stimulus  louder" 
(i.e  the  down-moving  responses) 


Estimate  of  a: 


s 


- >620  / variance  of  the  +i  ^9 

value  of  stop.  M  chosen  dlstrlb.  1  * 
in  stim.  unipfj\  in  (stlm.  units)2 


(.  / variance  of  the  1 

value  of  step  ]J  chosen  distrib.  +  .0291 
in  stim.  units}  j  in  ~  '  I 

A  surprising  feature  of  the  estimate  for  o  is  the  fact  that  e  depends 
on  the  variance  of  the  distribution  of  trials,  not  upon  the  square  root  of 
that  variance.  Note,  though,  that  the  second  form  of  the  formula  above 
agrees  in  direction  with  the  facts  observed  in  Figure  2.  When  the  step  size 
is  small,  the  variance  of  the  distribution  in  step  units  is  large:  and  when 
the  step  size  is  large,  the  variance  in  step  units  is  small.  It  is  reason¬ 
able  then  that  these  two  terms  should  enter  the  formula  for  s  as  a  product. 

The  development  of  this  estimate  is  described  by  Dixon  and  Mood  (1948), Anon. 
(1944).  It  ia  an  approximate  qasdjwna  likelihood  estimate. 

b.  The  calculation  of  m  (as  the  PSB)  and  s  (as  a  measure  of  the 
differential  threshold)  for  illustrative  data  of  Figure  1.  The  preceding 
formulae  are  applied  in  Table  2  to  the  calculation  of  m  and  s  for  the  loud¬ 
ness  discrimination  data  of  Figure  1.  For  these  data, computations  are  based 
upon  the  distribution  of  "s"  responses.  The  symbols  A  and  d  are  used  respect¬ 
ively  for  the  arbitrary  origin  and  for  stimulus  step.  The  value  of  m  turns 
out  to  be  87.36  db.,  meaning  that  the  subject's  bias  was(87.36  db-90.00  db) 
or  -  2.64  db.  The  standard  deviation  of  the  psychometric  function  is  esti¬ 
mated  as  1.17  db. 

c.  The  effect  on  m  and  s  of  the  position  of  u  relative  to  the  chosen 
•teat  levels.  The  estimate  of  p  1*  not  seriously  affected  by  the  position  of 


p  relative  to  the  test  levels,  as  long  as  the  size  of  stimulus  step  is 


Table  2 
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Calculation  of  m  and  a  from  the  data  of  Figure  1. 


Judgment 
Distributions 
in  Figure  1. 

Stim. 

level. 

L  S 

94 

1 

92 

1 

90 

4 

88 

7  3 

86 

1  7 

84 

1 

1.  Use  arbitrary  origin  procedure  and  work  with 
the  distribution  having  the  smaller  number  of 
events  (The  S-distribution  here): 


Let  arbitrary  origin  »  A  ■  86 

Let  deviations  from  A,  in  step  units,  be  x' 

Let  the  step  else  be  d:  here  d“2 


Stimulus 

Level 

x' 

f 

ffs* 

f(*’>2 

88 

mm 

3 

3 

86  -  A 

■ts 

0 

0 

64 

Hi 

-1 

1 

Zf-N- 

zfx'- 

Zf(<')2* 

11 

2 

4 

2.  Compute  m  from  the  formula: 

m  ■  (mean  of  chosen  distrib.)  t  •“  d 


m  •  A  + 


-  2  J"  86  +  2  (ll  +  !)■«•» 


3.  Compute  s  from  the  formula: 


or 


s  •  ^  Variance  of  chosen  distrib.  +  .029  d2  J 

.  •  1.620  seal  ♦  -02»J 


1.620  (2) 


( jJL-Lii  -  (2) 2  +  .029  \  -  1.17 

^  (II)2  J 
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less  Chan  2. So.  The  estimate  of  a,  however.  Is  much  more  sensitive  to  the 
position  of  relative  to  test  levels,  and  there  is  agreement  that  up-and- 
down  data  do  not  provide  as  good  an  estimate  of  a  as  they  do  of  i*.  For  smell 
samples,  in  fact,  s  may  be  of  little  value.  How  useful  it  may  be  in  psycho¬ 
physical  research  has  been  explored  in  some  of  the  studies  to  be  described 
below. 

8.  THE  ADEQUACY  OF  m  AS  AN  ESTIMATE  OF  4  WHEN  THE  PSCYCHOKETRIC  FUNCTION 
IS  NOHMAL. 

We  are  concerned  here,  as  we  are  in  relation  to  any  estimate,  with  the 
twin  problems  of  bias  and  reliability. 

If  an  up-and-down  series  were  to  consist  of  a  very  large  number  of 
trials,  a  very  natural  estimate  to  take  for  11  would  be  the  mean  of  the  stimu¬ 
lus  levels  used,  i.e.  the  mean  of  the  total  trial  distribution.  It  must  be 
noted,  however,  that  the  first  trial  is  given  at  a  level  which  is  chosen  by 
the  experimenter  and  has  nothing  to  do  with  the  subject's  discrimination. 

And  further,  if  the  experimenter  should  happen  to  start  the  up-and-down 
series  at  a  stimulus  level  quite  far  removed  from  the  subject's  PSE,  the 
entire  first  portion  of  the  series  would  consist  of  a  succession  of  like 
responses  which  would  merely  serve  to  bring  the  experimenter  and  the  subject 
into  the  general  vicinity  of  the  PSE.  This  means  that  when  the  up-and-down 
series  consists  not  of  a  large  number  but  rather  a  small  number  of  trials, 
the  simple  average  of  the  total  trial  distribution  could  be  a  considerably 
biased  estimate  of  11. 

One  way  to  think  about  this  matter  is  in  terms  of  the  "expected  stimu¬ 
lus  level"  for  each  of  the  early  trials  of  the  series.  Given  a  large  num¬ 
ber  of  up-and-down  series  all  of  which  start  at  a  particular  stimulus  level 
which  is  fairly  distant  from  the  PSE,  the  subject's  response  probabilities 
at  that  stimulus  level  dictate  that  some  of  these  series  will  move  up  for 
the  next  trial  and  some  will  move  down.  From  this  probability  Information, 
one  can  compute  the  "average"  stimulus  level  at  which  trial  2  will  be 
conducted.  Knowing  the  stimulus  levels  which  might  be  used  on  trial  2 
and  knowing  the  response  probabilities  at  each  of  these  levels,  one  can 
compute  the  average  or  expected  stimulus  level  for  trial  3.  The  computations 
become  more  elaborate  with  each  passing  trial,  but  continuing  them  one  may 
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find  the  expected  stimulus  level  £or  eech  of  the  early  triale  of  the  series. 
One  finds,  by  vey  of  illustration,  that  if  the  starting  level  if  4 a  below 
the  PSE  and  the  stimulus  step  is  lo,  it  is  not  until  the  tenth  trial  that 
the  expected  stimulus  levels  is  within  .01?  of  the  PSE.  Similarly,  if  the 
starting  level  is  4o  below  the  PSE  and  the  stimulus  step  is  2o,  the  expected 
stimulus  level  comes  within  .Oler  of  the  PSE  by  the  fifth  trial  of  the  series 
It  is  the  fact  that  these  early  expected  stimulus  levels  deviate  from  p 
which  would  make  the  average  of  the  total  trial  distribution  a  biased  esti- 
mate  of  p. 

a.  Wavs  of  minimising  bias  from  early  trials.  Clearly  we  wish  an  esti¬ 
mate  of  which  is  as  unbiased  as  we  can  make  it.  To  this  end,  several 
procedures  have  been  suggested: 

(1)  Start  the  up-andrdown  series  near  p.  This  is  not  always  possl 
ble,  particularly  in  psychological  experiments  dealing  with  judgment  bias. 
For  such  experiments  an  important  advantage  of  the  up-and-down  method  is 
that  the  trial  series  will  hunt  for  p,  no  matter  where  the  series  happens 

to  have  been  started. 

(2)  Consider  the  first  run  of  like  responses  to  be  a  preliminary 
series  of  trials  which  brings  the  series  close  to  p.  Drop  this  run  from 
the  record  and  estimate  p  on  the  basis  of  trials  which  follow  the  first 
"turn-around"  or  reversal  in  the  up-and-down  series. 

(3)  Consider  as  many  as  the  first  three  runs  of  like  responses  to 
be  preliminary,  and  base  the  estimate  of  p  on  the  distribution  of  trials 
following,  say,  the  third  turn-around. 

(4)  Consider  seme  fixed  number  of  initial  trials  as  preliminary 
and  drop  these  from  the  record.  If  step  sire  is  not  too  small  and  if  the 
trial  series  starts  at  a  level  not  too  far  from  p,  a  reasonable  number  of 
initial  trials  to  drop  is  5. 

(5)  Deal  only  with  the  trial  distribution  for  the  less  frequent 
response,  not  the  total  trial  distribution.  This,  of  course,  is  the 
Princeton  procedure  cited  above.  Because  the  trial  distributions  for  the 
■ore  frequent  and  the  less  frequent  responses  differ  only  to  the  extent 
that  the  up-and-down  series  does  not  return  to  the  level  from  which  it 
•tarts,  this  procedure  has  the  effect  of  removing  early  trials  from  the 
computation.  Unless  the  number  of  trials  is  large,  however,  this  estimate 

is  still  somewhat  biased  toward  the  initial  testing  level  (See  Dixon  and 
Massey,  1957). 
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(6)  In  conjunction  with  schemes  (2)  through  (5) ,  use  double-  or 
triple-size  stimulus  steps  during  the  trials  designated  as  preliminary. 

With  such  steps,  the  expected  stimulus  level  on  successive  trials  approaches, 
p  more  quickly,  turn-arounds  occur  sooner,  and  the  experimenter  obtains 
early  information  that  the  series  has  bracketted  p. 

In  the  work  of  our  project,  we  frequently  followed  procedures  (3)  and 

(4) ,  but  whether  we  did  nor  not  for  a  particular  set  of  data  we  always  used 

the  Princeton  estimate,  m,  for  4  (procedure  5  above).  The  particular 
programming  equipment  which  we  used  for  the  major  portion  of  our  work  did 
not  permit  the  adoption  of  procedure  (6) ,  although  its  use  would  have  been 
highly  desirable.  Hopefully,  our  combined  procedures  minimized  bias. 

For  a  discussion  of  this  bias  problem  in  relation  to  the  efficiency  of 
the  estimate  for  p,  the  reader  is  referred  to  Brownlee  et  al  (1953). 

b.  The  standard  error  of  m.  The  standard  error  of  m,  as  we  would  expect, 

decreases  with  an  Increase  in  the  number  of  trials  in  the  up-and-down  series 
and  increases  with  an  Increase  in  the  value  of  s.  It  also  increases  with 
step  size  (Anon.,  1944;  Dixon  and  Mood,  1948).  The  Princeton  formula  is: 

8  G 

Estimate  of  0  :  s  • 

*  ” 

where 

Ifc  is  the  number  of  trials  in  the  trial 

distribution  for  the  less  frequent  response. 

'  G  is  the  following  function  of  step  size. 

Step  size:  0.5a  1.0a  1.5a  2.0a  2.5a 

G:  0.94  1.00  1.07  1.15  1.18 

Although  there  was  originally  some  concern  that  this  formula  might  not 
provide  a  satisfactory  estimate  of  the  reliability  of  m,  unless  the  number 
of  trials  was  of  the  order  of  40  or  50,  analyses  by  Brownlee  jgt  al  (1953) 
have  shown  that  this  formula  la  reasonably  dependable  even  when  the  up-and- 
down  series  is  very  short  in  length.  Cur  own  examination  Of  the  Variability 
of  values  of  m  for  replicated  tests  using  the  up-and-down  method  also  con¬ 
firms  the  general  usefulness  of  the  foregoing  estimate  of  a  when  the 

m 

series  is  of  the  order  of  20  to  30  trials  in  length,  although- it  appears 
to  overestimate  slightly  (see  pages  40-42  below)  . 

To  the  psychologist  interested  in  conducting  experiments  using  relatively 
shert  up-and-down  series,  it  may  be  helpful  to  have  a  tabulation  of  values  of 
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a  as  a  function  of  the  number  of  trials  in  the  series  and  the  size  of  the 
m 

stimulus  step.  Such  a  table  is  presented  as  Table  3.  Note  that  Nc  in 

the  formula  for  a  is  the  number  of  trials  in  the  distribution  for  the  less 
in 

frequent  response.  N  given  in  Table  3  was  taken  simply  as  2NC, 

tor  • 

In  susmary  then,  we  find  that  regardless  of  how  ppor  the  experimenter's 
choice  of  initial  starting  level  for  the  up-and-down  series,  it  is  possible 
to  obtain  values  of  m  which  are  unbiased  estimates  of  p.  Given  that  m  is 
unbiased,  and  given  normality  and  stability  of  the  probability  of  response 
function,  the  standard  error  of  m  is  reasonably  well  represented  by  the 
Princeton  formula. 

c.  Comparison  of  the  reliability  of  m  with  that  of  the  estimate  of  u 
obtained  by  the  method  of  constant  stimuli.  It  has  been  Bhown  by  Dixon  and 
Hood  (1948)  as  well  as  by  Brownlee  et  al  (1953)  that  if  an  estimate  of  p  is 
to  have  a  given  reliability,  this  can  be  achieved  with  about  30%  fewer  trials 
using  the  up-and  down  method  than  by  using  the  method  of  constant  stimuli. 

This  comparison,  which  is  based  on  the  use  of  the  same  stimulus  steps  by 
the  two  psychophysical  procedures,  is  of  interest  in  relation  to  empirical 
comparisons  of  the  two  methods  which  will  be  described  below. 

9.  THE  MEANING  OF  m  WHEN  THE  PSYCHOMETRIC  FUNCTION  IS 

srareo. 

In  the  case  where  the  probability  of  response  curve  is  normal,  the  mean 
and  the  median  of  the  function  coincide.  When  the  function  is  skewed  the 
mean,  p,  moves  away  from  the  median  in  the  direction  of  the  longer  tall  of 
the  distribution.  Now  it  has  already  been  pointed  out  above  that  if  step 
size  is  small,  all  trials  will  cluster  very  close  to  the  median.  Oppositely, 
trials  will  jump  out  to  test  levels  represented  in  the  tails  of  the  dis¬ 
tribution  only  if  step  size  is  large.  The  consequence  of  these  relations 
is  that  m  falls  between  the  median  and  p  in  a  skewed  distribution.  Further¬ 
more,  it  resembles  the  median  more  closely  if  the  step  size  is  small  and 
resembles  p  more  closely  if  the  step  size  is  large. 

Confirming  computations  of  the  expected  value  of  m  for  four  distributions 
of  differing  degrees  of  skewness  placed  this  value  between  p  and  the  median 
in  all  cases.  And  when  the  step  size  was  larger,  i.e.  2o,  the  expected 
value  of  m  was  always  closer  to  the  mean  of  the  probability  of  response 
curve  than  when  the  step  size  was  smaller,  i.e.  only  lo.  See  Figure  3  for 

•  < 


24 


Table  3 


VALUES  OF  O'  AS  A  FUNCTION  OF  STEP  SIZE  AND 
0) 

NUMBER  OF  TRIALS,  N  ,  IN  THE  UP-AND-DOWN  SERIES, 

COt  • 

assuming  normal  modal  and  that 
initial  tast  level  is  near  u. 


Values  of  a  computed  from: 


oG 


Step 

Size: 

Number  of  trials 

in  the 

up-and-down 

"tot. 

10 

20 

30 

2.5cx 

.53a 

or  .21 

steps 

.38 a  or 

.15 

steps 

,31a  or 

.12 

steps 

2.0a 

.52a 

.26 

steps 

.37a 

.18 

steps 

.30a 

.15 

steps 

1.5a 

.48a 

.32 

steps 

.34a 

.23 

steps 

.28a 

.18 

steps 

1.0a 

.45a 

.45 

steps 

.32a 

.32 

steps 

.26a 

.26 

steps 

0.5a 

•42a 

.84 

steps 

.30a 

.59 

steps 

.25a 

.49 

steps 
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Figure  3:  The  effect  of  skewness  of  Che  psychometric  function 
upon  the  meaning  of  a.  The  data  for  the  above  distributions  are  given  in 
the  table  below.  All  distributions  have  been  plotted  above  with  a  common 
median  of  10.  The  mean  for  each  is  shown  by  the  vertical  slash. 


Distribution: 

Median: 

m  when  d-1. 
m  when  d“2. 
m  when  d"4 . 


Normal 

Diet.  A 

Dist.  B 

Dist.  C 

Dist.  D 

10.0 

10.0 

10.0 

10.0 

10.0 

10.0 

9.94 

9.92 

9.87 

9.89 

10.0 

9.92 

9.86 

9.79 

9.78 

10.0 

9.74 

9.51 

10.0 

9.90 

9.80 

,  9.65 

9.40 

Mean: 
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details.  For  purpose!  of  these  calculations,  skewed  distributions  were 
devised  as  composites  of  two  normal  distributions.  The  relative  weights 
given  to  the  component  distributions  varied,  as  did  the  relative  sise  of 
their  variances  and  the  displacement  of  their  means.  No  specific  meaning 
is  to  be  attached  to  the  scale  units  in  which  the  distributions  are  drawn. 
Stimulus  step  sises  are  given  on  the  same  arbitrary  scale. 

It  is  clear  that  the  up-and-down  method  will  be  applied  by  psycho¬ 
logists  in  many  measurement  situations  where  in  fact  the  probability  function 
is  skewed.  In  these  cases,  m  will  never  be  far  from  the  "middle"  of  the 
distribution,  but  its  expected  value  will  never  be  either  p  or  the  median 
of  the  psychometric  function.  That  this  will  ever  be  of  critical  importance 
is  doubted.  Such  importance  as  it  may  have,  however,  will  depend  upon  the 
degree  of  skewness  (i.e.  on  the  difference  between  p  and  the  median)  and 
also  upon  the  homogeneity  of  the  sets  of  data  being  compared.  If  all  data 
are  for  similarly  skewed  functions,  for  example,  differences  between  several 
obtained  values  of  m  should  still  be  valid  estimates  of  differences  between 
the  corresponding  values  of  p,  or  of  the  medians. 

It  may  be  noted  that  one  study  of  the  effect  of  non-normality  on  the 
estimates,  m,  has  been  reported  in  the  literature  (see  Votaw,  1948)  but 
this  paper  deals  with  response  functions  other  than  those  with  simple  skew¬ 
ness. 

10.  PREFERRED  STEP  SIZE;  IN  a  UNITS. 

The  importance  of  step  sise  for  the  effectiveness  of  up-and-down  testing 
has  already  become  apparent  from  preceding  sections.  Ue  would  like  a  step 
else,  d,  which  will  provide  the  best  possible  estimate  of  p  as  well  as  a 
good  estimate  of  a.  For  psychological  experiments,  we  also  want  it  to  be 
true  that  the  chosen  sise  of  step  will  favor  good  motivation  on  the  part  of 
our  subjects.  If  compromise  on  step  sise  is  necessary,  the  following  facts 
are  relevant: 

»•  Reasons  for  avoiding  a  step  sise  which  is  too  small. 

(1)  Too  small  a  step  sise  leads  to  a  waste  of  many  early  trials 
if  that  step  size  is  used  from  the  very  beginning  of  the  series  and  the 
experimenter  happens  to  start  at  a  level  which  is  some  distance  from  p. 

(See  section  8  above) . 
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(2)  A  step  site  ■mailer  than  1?  leads  to  a  leas  reliable  esti¬ 
mate  of  a  than  does  a  step  else  between  1(7  and  2o.  (see  Anon. ,  1944; 

Dixon  and  Mood ,  1948)  . 

(3)  A  small  step  site  keeps  all  trials  very  near  the  50-50  point 
on  the  psychometric  function.  (See  section  6  and  Figure  2  above) .  For 
differential  judgments  in  particular,  thia  means  that  the  subject  never 
has  any  easy  trials.  Motivation  may  fall  if  he  feels  that  he  is  guessing 
on  every  trial.  Mote  that  with  a  step  sice  of  0.5a,  70%  of  all  trials  ere 
conducted  within  0.5(7  of  u,  and  essentially  no  trials  are  expected  to  oc¬ 
cur  at  "easy"  levels  as  far  as  2a  from  u.  With  a  step  else  of  2a,  however, 
only  about  251  of  all  trials  are  conducted  within  0.5a  of  u,  while  some 
33%  of  ell  trials  are  conducted  in  the  "easy"  range,  2a  or  more  from  u. 

b.  Reasons  for  avoiding  a  step  size  which  is  too  large. 

(1)  A  step  size  larger  than  2a  makes  it  necessary  to  follow  a 
more  complicated  procedure  to  estimate  a  than  that  given  above.  (See 
Anon.,  1944;  Dixon  and  Mood,  1948).  As  Dixon  and  Massey  (1957)  put  it, 
the  above  estimate  of  a  ia  quite  accurate  as  long  aa  the  variance  of  the 
chosen  distribution  is  larger  than  0.3  when  computed  from  data  in  stimulus 
steps,  but  breaks  down  when  the  variance  becomes  less  than  0.3  (i.e.  s<.5) . 

(2)  A  step  size  larger  than  2o  leads  to  a  less  reliable  estimate 
of  a  than  does  a  step  size  between  la  and  2a.  (See  Anon.,  1944;  Dixon 
and  Mood ,  1948) . 

(3)  For  Increasingly  large. step  sizes,  the  value  of  G  in  the 
formula  for  a^  grows  larger  and  larger,  and  hence  the  value  of  s^  increases 
It  is  helpful  to  know,  however,  that  over  the  range  of  step  sizes  from  la 
to  2a,  G  increases  by  only  a  small  amount,  namely  about  15%.  (See  section 
8  above)  . 

(4)  For  step  sizes  larger  than  2.5a,  G  depends  upon  an  unknown 
value,  namely  the  amount  by  which  u  departs  from  the  testing  level  nearest 
to  it.  This  means  that  for  very  large  steps,  ■|g  is  never  well  determined. 
In  fact,  trials  using  exceedingly  large  steps  may  bracket  u  repeatedly  with 
out  providing  any  information  at  all  about  the  intervening  position  of  u. 

c.  The  preferred  step  size;  between  la  and  2a.  The  facts  Just  pre¬ 
sented  point  to  the  use  of  a  step  between  la  and  2a.  The  general  recommen¬ 
dation  of  those  who  made  early  evaluations  of  the  method  was  to  use  a  step 
as  close  to  la  as  possible.  For  one-trial-per-subject  experiments  (see 
discussion  below)  this  recommendation  stands.  For  prolonged  differential 
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discrimination  task*  with  the  seme  subject,  however,  it  seems  quite  essen¬ 
tial  that  we  ease  the  subject's  task  by  giving  him  a  reasonable  number  of 
easy  judgments.  This  means  using  stimulus  steps  as  large  as  can  be  toler¬ 
ated  by  other  considerations.  Here  we  suggest  a  step  size  close  to  2o  — 
say  at  least  1.5a,  but  not  exceeding  2o.  The  gain  in  measurement  reliability 
which  comes  from  better  subject  motivation  is  not  easily  quantified,  but  it 
probably  exceeds  the  small  loss  in  the  reliability  of  m  which  comes  with  the 
use  of  steps  which  are  2a  in  size  rather  than  la. 

11.  METHODS  FOR  ESTABLISHING  A  STIMULUS  STEP  OF  THE  PREFERRED  SIZE. 

Knowing  how  large  a  step  should  be  in  a  units  still  leaves  us  with  the 
matter  of  determining  how  large  the  step  should  be  in  stimulus  units .  Two 
procedures  are  useful  here. 

The  first  and  obvious  procedure  for  choosing  d  is  on  the  basis  of  an 
estimate  of  a:  (a)  Run  a  preliminary  series  of  up-and-down  observations  using 
any  step  thought  to  be  suitable,  (b)  Compute  s  for  this  series,  (c)  Run 
a  new  series  with  step  size  between  1.5s  and  2 s.  (d)  Compute  s  for  this 
new  series  and  see  if  it  agrees  with  the  first  estimate,  etc.  From  a  num¬ 
ber  of  such  series  the  value  of  a  can  be  approximated  and  the  stimulus  step 
chosen  accordingly. 

The  other  procedure  is  cruder,  may  require  more  data,  but  has  merits 
of  its  own  as  a  simple  way  of  monitoring  the  chosen  value  of  d  throughout 
the  experiment.  This  procedure  does  not  depend  upon  the  calculation  of  s 
for  each  series,  but  is  based  instead  on  the  range  of  test  levels  used  in 
typical  up-and-down  series  —  a  strategy  suggested  by  the  distributions  in 
Figure  2,  page  15.  Suppose  that  over  a  number  of  up-and-down  series,  no 
series  ever  requires  the  use  of  more  than  3  stimulus  levels.  The  implica¬ 
tion  from  Figure  2  would  be  that  the  step  size  is  too  large,  probably  3a 
or  more.  On  the  other  hand,  if  as  many  as  6  stimulus  levels  are  used  in 
series  after  series,  the  Implication  would  be  that  the  step  size  is  too 
small,  probably  in  the  range  0,5a  to  l  ,0a.  These  observations  led  us  (Guth 
and  Kappauf)  to  conduct  an  empirical  sampling  study  to  determine  how  many 
stimulus  levels  would  be  used  with  what  relative  frequency  if  the  stimulus 
step  were  1.5  to  2a  in  size.  Our  finding  was  thait  for  moderately  short 
up-and-down  aeries,  most  aeries  vould  Involve  3  or  4  levels,  only  a  few 
would  Involve  5  levels,  and  none  would  be  expected  to  extend  over  as  many 
as  6  levels.  The  specific  data  obtained  for  these  and  other  step  sizes  are 
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•born  In  Table  4.  The  Information  given  there  may  be  taken  as  a  guide 
while  one  is  making  the  original  selection  of  stimulus  step  else  for  an 
experiment,  but  perhaps  more  importantly,  can  be  used  to  monitor  one's 
choice  of  stimulus  step  while  the  experiment  le  under  way.  Changes  in  the 
number  of  test  levels  used  as  the  experiment  proceeds  may  reflect  practice 
effects,  changing  test  conditions,  etc.  and  point  to  the  desirability  of 
revising  the  value  of  d  for  testing  under  these  new  conditions.  The  general 
rule  which  emerges  from  the  table  is  that  our  stimulus  step  Is  in  the  range 
of  1.5  to  2o  if,  starting  at  an  Initial  test  level  near  u,  we  find  the 
frequent  use  of  4  stimulus  levels,  the  Infrequent  use  of  5,  and  no  use  of 
6. 

(One  side  comment  is  in  order  here.  The  mere  fact  that  testing  over 
a  series  of  stimulus  levels  results  in  a  see-sawing  up-and-down  series  is 
in  itself  no  proof  that  behavior  is  changing  with  stimulus  level  or  that 
the  probability  of  response  function  over  these  levels  ranges  from  .00  to 
1.00.  Suppose  thst  the  probability  of  response  X  and  the  probability  of 
response  Y  are  each  .50  at  evorv  stimulus  level.  The  up-and-down  series 
then  becomes  an  illustration  of  the  mathematician's  random-walk  problem. 
Empirical  sampling  under  these  conditions  indicates  that  15-trial  series 
will  range  on  the  average  over  6.0  test  levels,  20-trlal  series  over  7.3 
test  levels,  and  30-trlal  series  over  8.5  test  levels.  Such  series  might 
appear  to  be  brackettlng  a  50-50  point,  but  of  course  they  are  not.  The 
usa  of  a  large  number  of  test  levels  may  thus  mean,  in  certain  untried  test 
situations  at  least,  that  there  is  nothing  to  measure,  no  50-50  "point"  to 
locate .) 

12.  THE  REQUIREMENT  OF  INDEPENDENCE  OF  SUCCESSIVE  TRIALS,  AND  HAYS  OF 

MEETING  IT,  IN  WHOLE  OR  IN  PART. 

Thus  far  in  our  discussion,  little  sttentlon  has  been  given  to  the 
fact  that  the  model  for  the  up-and-down  method  presumes  that  the  probabil¬ 
ity  of  response  curve  is  unchanging  from  trial  to  trial.  The  probability 
of  each  of  the  alternative  responses  at  a  given  stimulus  level  is  presumed 
to  be  the  same  regardless  of  the  nature  of  the  previous  trial  and  the 
response  which  was  made  on  that  trial.  This  is  to  say  that  successive 
trials  are  assumed  to  be  Independent.  All  the  computations  of  expected 
trial  distributions  were  based  upon  this  premise. 
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Results  of  empirical  sampling  study:  probability  of  using  a  given  number  of  test 
levels  in  an  up-and-down  series  as  a  ft  motion  of  step  sise  and  number  of  trials  (N  ), 
assuming  normal  model  and  that  initial  test  level  is  near  |i.  ° 


Step 

Sise 

Number 
of  test 
levels 
used 

I*  exactly  at  a  test  level, 
N  (the  nymber  of  trials 

tot*  'series)  la 

15  20  25 

and 

in  the 

30 

H  midway  between  test  levels, 

and  N„  .  (the  number  of  trials 
tot. 

in  the  series)  is 

15  20  25  30 

3  or  less 

1.00 

1.00 

1.00 

1.00 

.88 

.87 

.79 

.75 

3.0a 

4 

5 

.12 

.13 

.21 

.25 

6  or  more 

(Mostly  3* a 

,  no  5' 

s) 

1 

3  or  less 

.85 

.80 

.75 

.75 

.62 

.43 

.42 

,2ol 

4 

.13 

.13 

.17 

.15 

.38 

.57 

.58 

.80 

4  e  vU 

5 

.02 

.07 

.08 

.10 

6  or  more 

(Frequent  4' a 

,  some 

5's) 

3  or  less 

.48 

.37 

.38 

.30 

.42 

.20 

.25 

.15 

1  5c 

4 

.50 

.60 

.50 

.55 

.58 

.80 

.75 

.85 

5 

.02 

.03 

.12 

.15 

r 

6  or  more 

i 

(Vary  frequent  4's,  some  5's) 

3  or  less 

.25 

.20 

.17 

.10 

.20 

.10 

.00 

.00, 

1  (Vt 

4 

.68 

.53 

.50 

.45 

.60 

.57 

.58 

.50j 

5 

.08 

.27 

.33 

.45 

.20 

.30 

.38 

.45 1 

6  or  more 

.00 

.03 

.04 

.05' 

(Frequent  5' a 

i,  some 

6's) 

i 

1 

3  or  leas 

.05 

.03 

.00 

.00 

.05 

.00 

.00 

! 

.00  j 

ft  Srr 

4 

.55 

.50 

.33 

.25 

.55 

.57 

.38 

.20 

5 

.30 

.33 

.38 

.35 

.25 

.23 

.38 

.45, 

6  or  more 

.10 

.13 

.29 

.40 

.15 

.20 

.25 

.35 

(Frequent  6's) 

Note:  The  probabilities  cited  in  this  table  were  determined  from  10  up-and-down  series 
of  600  "trials"  each,  where  the  outcome  of  each  trial  was  determined  by  consulting  a 
table  of  random  normal  deviates.  The  10  series  Involved  the  5  step  sizes  and  the  2 
locations  of  y.  Each  series  of  600  trials  was  then  subdivided  Into  sub-series  of  15, 
20,  25  and  30  trials  on  which  counts  of  the  number  of  levels  used  were  made.  Thus,  the 
probabilities  for  series  of  length  15  were  based  on  40  sub-series,  those  for  series  of 
length  20  on  30,  of  length  25  on  24,  and  of  length  30  on  20. 
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a.  The  up-and-down  series  based  on  one  trial  per  subject.  One  way  of 
asaurlng  the  Independence  of  aucceaslve  trials  is  to  conduct  each  trial  on 
a  different  individual.  In  this  case,  the  probability  of  response  curve 
applies  to  the  population  of  individuals,  and  the  variability  of  this 
curve  reflects  both  intra-  and  inter-individual  variability.  The  procedure 
is  nevertheless  effective  for  some  studies,  and  we  have  used  it  in  work 

on  taste  preferences  in  animals  as  well  as  in  the  experiment  on  the  time 
error  with  human  subjects.  These  applications  will  be  discussed  below. 

b.  Concurrent  up-and-down  series  with  same  subject.  Host  often,  how¬ 
ever,  the  object  of  a  psychophysical  experiment  is  to  quantify  the  discri¬ 
mination  of  individual  subjects.  To  this  end  the  same  subject  oust  be 
tested  over  many  trials.  Under  these  circumstances  there  are  at  least 
three  forms  of  trial  dependencies  which  could  occur. 

The  first  of  these  is  dependency  associated  with  the  direct  effect  of 
the  immediately  preceding  trial(s)  upon  each  current  judgment.  This,  of 
course,  is  a  matter  of  experimental  interest  in  its  own  right,  and  is  THE 
matter  of  experimental  interest  for  the  context  studies  reported  later. 

The  second  form  of  possible  trial-to-trlal  dependency  is  that  associated 
with  Ilmen  drift,  Ilmen  fluctuation,  criterion  fluctuation,  etc.,  during 
the  course  of  the  experiment  (see  Day,  1951;  Verplanck  et  al ,  1952;  Kappauf 
and  Payne,  1954,  1955).  If  such  drifts  or  fluctuations  occur,  we  have  a 
situation  where  u  is  really  changing  and  not  remaining  fixed.  Successive 
trials  are  not  completely  independent  because  they  sample  the  subject's 
behavior  at  moments  when  u  has  varied  little  if  at  all,  whereas  more  widely 
separated  trials  occur  at  times  when  u  may  be  different. 

The  third  form  of  trial-to-trlal  dependency  is  unique  to  sequential 
testing  methods  and  has  long  been  a  subject  of  discussion  with  regard  to 
the  method  of  limits  (see  e.g.  Titchener,  1905;  Urban,  1908).  It  concerns 
the  fact  that  if  the  subject  is  aware  of  the  sequential  order  of  trials, 
his  expectation  or  set  will  change  as  the  series  of  trials  proceeds.  For 
the  up-and-down  method,  in  particular,  a  subject  might  well  "see  through" 
or  "discover"  the  experimenter's  up-and-down  program  of  trials.  Once  he 
does  so,  he  might  reduce  his  Judgments  on  successive  trials  to  a  simple 
alternation  between  response  X  and  response  Y.  Clearly  successive  trials 
would  no  longer  be  Independent. 

There  are  several  ways  in  which  we  should  be  able  to  conceal  the  up- 
and-down  trial  sequence  from  our  subject,  and  hence  deal  with  this  third 
source  of  trial  dependency.  One  way  would  be  to  intersperse  one  or  more 
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"dummy"  trials  between  successive  "regular"  trials  of  the  up-and-down  series. 
(This  corresponds  to  Fernberger's  procedure  (1913)  for  obscuring  the  serial 
order  of  trials  in  the  method  of  limits)  .  Another  way  would  be  to  arrange 
the  Interspersed  trials  as  trials  on  other  up-and-down  series.  The  lat¬ 
ter  plan  means  that  we  conduct  the  epxeriment  with  a  number  of  concurrent 
up-and-down  series.  Successive  trials  in  the  experiment  are  then  randomly 
chosen  from  the  different  series.  Experience  indicates  that  this  completely 
conceals  the  basic  testing  routine  from  the  subject.  In  our  experiments, 
the  number  of  concurrent  series  used  in  tests  with  a  single  subject  has 
ranged  from  3  to  as  many  as  12. 

Suppose  that  aeveral  concurrent  series  are  run  all  using  the  same 
standard.  Here  one  would  anticipate  that  the  effects  of  preceding  trials 
would  be  random,  and  hence  equivalent  within  random  error,  for  each  of  the 
several  series.  Limen  drift  or  criterion  drift,  which  ought  to  affect  all 
series  about  equally,  should  tend  to  increase  s  for  each  series  but  stabilize 
the  values  of  m  for  the  several  series.  Thus  the  values  of  mmay  differ 
somewhat  less  from  one  another  than  would  be  expected  on  the  basis  of  some 
composite  estimate,  say  s  ,  of  cr  .  This  problem  will  be  examined  in  more 
detail  in  section  14  below  with  regard  to  some  loudness  discrimination  data. 

The  use  of  several  concurrent  up-and-down  series  during  continued  ob¬ 
servation  by  a  given  subject  is  thus  thought  to  eliminate  the  potentially: 
most  troublesome  source  of  trial-to-trial  dependency  but  leaves  two  others. 

Both  of  the  latter  will  be  tha  subject  of  study  below. 

13.  UP-AND-DOWN  TESTING  BASED  ON  ONE  TRIAL  PER  SUBJECT:  RESULTS  OP  FOUR 

EXPERIMENTS  AND  OBSERVATIONS  ON  THE  VARIABILITY  OF  m  OVER  REPLICATIONS. 

The  one-trial-per-subject  procedure  la  of  Interest  because  it  guarantees 
independence  of  successive  trials  in  the  up-and-down  series,  and  also  because 
it  rules  out  bias  which,  under  a  massed-trials-with-the-same-subject  proce¬ 
dure,  may  arise  from  the  subject's  continued  experience  in  the  test 
situation.  Such  bias  is  found  in  context  effects  from  comparison  stimuli,  in 
effecfs  associated  with  changes  in  ibotlvatloh  level  with  continued  testing,  qtc. 

We  first  used  the  one-trial-per-subject  procedure  in  an  early  project 
experiment  where  it  was  our  interest  to  measure  the  time  error  under  con¬ 
ditions  which  would  be  as  free  as  possible  from  context  effects.  Subsequently 
the  procedure  was  used  in  a  series  of  non-project  experiments  designed  to 
map  families  of  taste  preference  functions  (isohedons)  for  the  white  rat. 
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Here  our  concern  was  to  limit  the  quantity  of  incentive  fluidB  ingested  by 
any  one  animal  in  order  that  the  obtained  preference  functions  would  be 
associated  with  taste  preferences  as  such,  uncontaminated  by  the  effects  of 
post-lngestlonal  factors. 

Clearly,  the  one-trial-per-subject  procedure  can  be  efficient  if  and 
only  if  the  instruction  time  or  "readying  time"  per  subject  is  not  long. 
This  condition  was  met  in  the  applications  to  be  discussed  here. 

a.  Experiment  T-l:  The  effect  of  inter-stimulus  interval  on  the  mag¬ 
nitude  of  the  time  error,  measured  under  conditions  of  minimum  context. 

(1)  Introduction:  The  "time  error"  (or  the  "time  order  error") 
is  the  name  given  to  that  bias  in  differential  judgment  which  arises  as  a 
function  of  the  fact  that  the  tv;o  stimuli  to  be  compared  are  separated  in 
time.  Most  often  the  time  error  proves  to  be  negative,  that  is,  the  second 
stimulus  need  not  be  as  intense  as  the  first  in  order  to  appear  equal  in 
intensity  to  the  first.  Under  some  conditions  of  stimulus  context  or  of 
inter-stimulus  interval,  however,  the  error  proves  to  be  positive,  that  is, 
the  second  stimulus  need3  to  be  more  intense  than  the  first  if  the  two  are 
to  appear  equal  in  intensity.  Problems  which  have  attracted  interest  with 
regard  to  the  time  error  in  differential  judgment  include  the  following: 

(1)  How  does  the  time  error  vary  in  magnitude  as  a  function  of  inter¬ 
stimulus  interval?  (See  Kohler,  1923;  Needham,  1934,  1935;  Koester,  1944- 
45)  .  (2)  How  general  is  the  time  error?  For  what  judgment  situations  is 

the  time  error  present,  for  which  ones  absent?  (e.g.  Postman,  1946; 
Stevens,  1957).  (3)  How  does  the  time  error  for  judgments  with  a  given 

standard  vary  as  a  function  of  stimulus  context,  i.e.  as  a  function  of 
having  trials  with  other  standards  scheduled  in  the  same  session?  (e.g. 
Woodrow,  1933;  Needham,  1935;  Koester  and  Schoenfeld,  1946).  (4)  How  may 

one  best  explain  the  time  error  and  the  manner  of  its  variation  with  sit¬ 
uation  variables?  (Kohler,  1923;  Pratt,  1933;  Woodrow,  1933;  Michaels  and 
He Ison,  1954). 

In  general  the  time  error  has  been  a  very  variable  phenomenon, 
not  only  variable  from  individual  to  individual  (Needham,  1934)  but  espec¬ 
ially  variable  as  a  function  of  experience  in  the  test  situation  (Kohler, 
1923;  Needham,  1934;  Stott,  1935),  Continued  experience,  however,  has 
always  meant  continued  exposure  to  the  same  set  of  comparison  stimuli  as 
called  for  by  the  method  of  constant  stimuli.  It  would  appear  then  that 
context  factors  may  have  been  important  in  determining  the  time  errors 
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displayed  by  experienced  subjects.  One  time  error  function  which  is 
reported  to  be  very  sensitive  to  experience  effects  is  the  so-called 
"p-function,"  i.e.  the  function  relating  the  time  error  to  inter-stimulus 
interval.  (Kohler,  1923;  Needham,  1934).  Because  this  function  has  been 
central  in  some  theoretical  discussions  of  the  time  error,  it  appeared  of 
interest  to  determine  the  effect  of  inter-stimulus  interval  of  the  time 
error  for  completely  naive  subjects  -  subjects  who  are  tested  for  only  one 
trial  each. 

In  using  the  up-and-down  method  in  this  study,  we  gained  two  ad¬ 
vantages.  (a)  We  were  able  to  specify  the  magnitude  of  the  time  error  in 
stimulus  units.  Previous  work  had  had  to  rely  on  less  direct  measures  as 
the  relative  proportion  of  preponderance  of  "greater-than"  and  "less-than" 
judgments  (i.e.  the  D-X  measure),  (b)  We  were  able  to  evaluate  our  results 
using  significance  testa  based  upon  computed  values  of  s  and  s^. 

(2)  Purpose:  to  determine,  for  several  discrimination  tasks, 
the  magnitude  of  the  time  error  as  a  function  of  inter-stimulus  Interval 
under  conditions  of  minimum  context,  i.e.  where  each  subject  makes  only 
one  judgment  per  discrimination  task. 

(3)  Discrimination  tasks:  Each  subject  made  three  differential 

Judgments  in  this  order:  a  judgment  of  the  difference  between  two  weights, 

one  of  difference  between  two  auditory  durations,  and  one  of  the  difference 

between  two  pressures.  The  stimuli  employed  were  the  following: 

Lifted  weights:  The  subject  lifted  50  cc.  Erlenmyer  flasks  filled  with 
cotton  and  lead  shot.  Tne  standard  was  100  grams.  The  comparison  stimuli 
were  in  the  logarithmic  series  71,  79,  89,  100,  112,  126,  etc.  grams. 

Duration  discrimination:  The  stimulus  employed  was  a  buzzer  controlled  in 
its  duration  by  a  S toe 1 ting  timer.  The  standard  was  of  3  seconds  duration. 
The  comparison  stimuli  were  in  the  logarithmic  eeries  2.13,  2.25,  2.38, 
2.52,  2.67,  2.83,  3,00,  3.18,  3.37,  3.57,  3.78,  4.00,  etc.  seconds. 

Treasure  discrimination:  The  subject  depressed  a  key  with  his  forefinger. 
The  key  was  on  one  end  of  a  weighted  lever  and  the  subject  could  experience 
the  pressure  by  depressing  the  key  a  distance  of  one-eighth  inch.  The 
pressure  was  50  grams  for  the  standard,  and  was  28.1,  31.6,  35.4,  39.7, 
44.5,  50.0,  56.2,  62.8,  70.7,  etc.  for  the  comparison  stimulus  conditions. 

Inter-stimulus  Intervals:  These  were  2.5,  5.0,  10.0  and  20.0  seconds. 

(4)  Procedure:  Each  subject  came  to  the  laboratory  individually. 
He  sat  at  a  table  where  he  could  perform  all  three  tasks.  The  subject  was 
shown  the  100  gram  standard,  Instructed  in  lifting  it  for  3  seconds,  and 
allowed  to  lift  it  once  for  practice  and  for  familiarizing  himself  with  its 
weight.  He  was  told  that  our  purpose  was  to  see  how  well  he  could  judge 


35 


the  difference  in  weight  between  two  flAaks  when  the  second  was  not  avail¬ 
able  until  _  seconds  after  the  first  had  been  put  down.  The  trial  was 

run  with  the  subject  lifting  on  the  word  "LIFT"  and  lowering  the  weights 
on  the  word  "DOWN."  He  judged  the  second  weight  as  "Heavier"  or  "Lighter" 
than  the  first.  He  was  then  introduced  to  the  duration  task  and  allowed 
to  listen  once  to  the  standard  so  that  he  would  know  what  general  length 
of  sound  to  expect.  He  was  told  what  the  delay  would  be  between  the  two 
buzzes  when  the  trial  was  conducted,  and  then  the  trial  was  run  with  judg¬ 
ments  of  "Longer"  or  "Shorter"  being  the  only  ones  admissable.  For  the 
pressure  task  the  routine  was  similar,  the  subject  being  allowed  to  depress 
the  key  to  feel  the  standard  once  before  the  comparison  trial  began.  In 
the  trial  itself,  he  depressed  the  key  on  "Down"  and  released  it  slowly  on 
"Up."  Again  he  was  allowed  only  judgments  of  "Heaver"  or  "Lighter."  Three 
different  delays  were  used  for  the  three  tasks  for  each  subject.  The 
order  of  delays  was  balanced  across  subjects  so  that  at  the  end  of  the 
experiment  45  subjects  had  made  each  judgment  with  each  delay. 

(5)  Subjects-.  The  subjects  were  180  undergraduates,  men  and 
women,  recruited  from  classes  in  elementary  and  experimental  psychology. 

(6)  Results:  The  data  are  summarized  in  Figure  4  and  Table  5. 

For  the  lifted  weights  discrimination,  the  values  of  s  for  the 

four  delay  conditions  were  not  significantly  different  when  evaluated  by 
Hartley's  F  test.  The  weighted  average  value  of  s,  5,  was  1.16  steps, 
implying  that  the  step  size,  d,  was  not  too  badly  chosen  for  this  task, 
being  about  0.9  s.  (It  had  been  our  objective  to  obtain  steps  approxi¬ 
mately  equal  to  la  for  all  three  discrimination  tasks)  .  a  computed  from 

m 

s  proved  to  be  .25  steps.  The  range  of  the  four  values  of  m,  1.  e.  the 
time  error,  for  the  different  delay  conditions  was  tested  against  s  (.05 

ID 

level  q-test).  This  test  led  to  rejection  of  the  hypothesis  that  the  inter¬ 
stimulus  Interval  was  without  effect  upon  the  time  error.  Based  on  the 
amount  of  data  here  collected,  the  time  errors  for  the  2.5  and  for  the  5 
second  inter-stimulus  intervals  were  not  significantly  different  from 
zero  (.05  level  tests),  while  those  for  10  and  20  seconds  were.  The  over¬ 
all  pattern  of  the  results  plotted  in  the  figure,  however,  are  consistent 
with  classical  results:  the  time  error  for  short  inter-stimulus  intervals, 
say  below  3  seconds,  appears  to  be  positive  for  naive  subjects. 
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INTER- STIMULUS  INTERVAL  IN  SECONDS 


Figure  4:  PCF  and  time  error  as  a  function  of  interstimulua 
interval  for  three  discrimination  tasks. 


TIME  ERROR  IN  GRAMS  TIME  ERROR  IN  GRAMS 
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Table  5 


DATA  FOR  EXPERIMENT  T-l:  TIME  ERROR  AS  A  FUNCTION  OF  INTER-STIMULUS  INTERVAL. 


WEIGHT  DISCRIMINATION  (100  gm  standard)  2. 

.5  secs 

5.0  secs 

10  secs 

20  secs 

PSE  in  steps  re  standard 

+0.09 

-0.23 

-0.73 

-1.00 

PSE  in  grams 

101.0 

97.4 

91.9 

89.0 

Time  Error  in  grams 

+1.0 

-2.6 

-8.1 

-11.0 

s  in  steps 

1.32 

0.81 

1.07 

1.34 

affl  in  steps  (based  on  1) 

0 

^25  ' 

PRESSURE  DISCRIMINATION  (30  gm  standard) 

PSE  in  steps  standard 

+0.40 

-0.07 

-0.93 

-1.74 

PSE  in  grama 

54.4 

49.6 

44.8 

40.9 

Time  Error  in  grams 

+2.4 

-0.4 

-5.2 

-9.1 

8  in  steps 

.80 

2.14 

1.22 

1.42 

s  in  steps  (conservative) 

ID 

0.38 

DURATION  DISCRIMINATION  (3  secs,  standard) 

PSE  in  steps  re  standard 

-0.41 

-0.69 

+0.14 

-1.05 

PSE  in  secs. 

2.93 

2.88 

3.02 

2.82 

Time  Error  in  secs. 

-.07 

-.12 

+  .02 

-0.18 

s  in  steps 

3.27 

1.68 

1.31 

2.95 

sm  in  steps  (conservative) 

0.56 
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For  the  pressure  discrimination,  the  function  obtained  agreed  well 
with  the  pattern  of  the  function  obtained  for  the  weight  discrimination. 

This  agreement  should  not  be  too  surprising  in  view  of  the  kinesthetic 
similarity  of  the  two  tasks.  In  the  case  of  pressure,  the  values  of  s  for 
the  up-and-down  series  for  the  four  different  time  intervals  were  signi¬ 
ficantly  different  (.05  level,  F(Qax  test).  The  average  step  size  was 
smaller  than  desired,  being  about  0.7  s.  The  most  conservative  estimate 

of  the  value  of  7  ,  based  on  the  largest  obtained  value  of  s,  was  .38 
m 

steps.  In  terms  of  this  value,  the  four  obtained  time  errors  differed 
significantly  (.05  level,  q-test)  and  the  errors  at  10  and  at  20  seconds 
were  significantly  different  from  zero.  Note  that  the  time  errors  were 
relatively  larger  for  the  pressure  discrimination  than  for  the  weight  dis¬ 
crimination.  Nevertheless,  the  influence  of  inter-stimulus  interval  on 
time  errors  for  the  two  discriminations  was  very  similar. 

For  the  duration  discrimination,  the  results  were  far  less  neat 
and  clear.  We  believe  this  to  have  been  a  function  of  the  fact  that  the 
step  size,  which  had  been  chosen  on  the  basis  of  pilot  work  with  experienced 
subjects,  proved  to  be  for  naive  subjects  considerably  smaller  than  desired, 
namely  about  0.4  s.  For  such  a  step  size,  s  has  fairly  low  reliability. 
Analysis  of  the  data  indicates  that  the  values  of  s  obtained  for  the  four 
different  delay  conditions  were  significantly  different,  while  the  values 
of  m  (the  time  errors)  for  the  four  conditions  were  not  significantly  dif¬ 
ferent  (conservatively  tested  using  largest  obtained  s  as  basis  for  error 
estimate).  The  over-all  time  error,  averaged  over  the  four  inter-stimulus 
conditions  was  about  0.5  steps,  and  the  data  do  not  Justify  concluding 
that  this  was  significantly  different  from  zero.  The  obtained  error  was 
negative,  however,  and  this  does  agree  in  direction  with  results  obtained 
by  Stott  (1935) ,  and  by  Woodrow  and  Stott  (1936)  for  duration  judgments 
when  the  standard  is  of  3  seconds  duration. 

(7)  Discussion.  Two  methodological  comments  are  in  order. 

First:  it  is  of  Interest  to  note  that  since  the  testing  of  any  one  subject 
in  the  present  experiment  did  not  require  an  extended  series  of  trials, 
it  was  not  inconvenient  to  use  inter-stimulus  intervals  as  long  as  20 
seconds.  Most  studies  in  the  literature  had  stopped  short  of  this  long  a 
delay  interval.  Second:  the  duration  portion  of  the  study  points  to  the 
Importance  of  having  the  size  of  stimulus  step  within  the  desired  range, 
as  discussed  in  Section  10  above.  Were  the  duration  tests  being  conducted 
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today,  we  would  benefit  by  the  sampling  data  presented  In  Table  4,  page 
30  above.  The  up-and-down  series  for  the  first  15  subjects  had  already 
covered  5  stimulus  levels  in  the  case  of  the  2.5  second  delay  condition, 

5  for  the  5  second  condition,  5  for  the  10  second,  and  6  for  the  20  second 
condition.  With  Table  4,  we  would  have  been  warned  that  we  were  using 
a  step  size  of  the  order  of  ,5a  and  would  have  revised  our  stimulus  series. 

Three  content  results  with  regard  to  the  time  error  are  also 
worthy  of  comment.  First:  the  time  error  for  completely  naive  subjects, 
making  one  judgment  only  and  hence  not  subject  to  context  effects  arising 
within  the  experiment  itself,  changes  for  both  weight  discrimination  and 
pressure  discrimination  as  a  function  of  inter-stimulus  interval.  The 
nature  of  this  function  is  much  as  Kohler  described  it  on  the  basis  of  his 
early  experiments.  Second:  the  magnitude  of  the  time  error,  which  was 
readily  quantified  using  the  up-and-down  method,  became  as  large  as  11 
grams  when  the  100  gram  standard  weight  had  been  lifted,  and  as  large  as 
9  grams  when  the  50  gram  pressure  had  been  experienced .  Both  of  these  ex¬ 
treme  errors  were  observed  with  the  20  second  inter-stimulus  Interval,  and 
it  remains  a  possibility  that  the  error  would  grow  still  more  of  the  inter- 
stimulus  interval  were  extended.  Third:  there  was  no  tendency  for  varia¬ 
bility  of  judgment  (3)  to  increase  with  inter-stimulus  interval,  as  might 
have  been  expected  from  a  variety  of  points  of  view. 

b.  Three  experimant  to  map  taste  isohedons  in  the  rat. 

(1)  Background.  Some  three  years  ago,  P.  T.  Young  became  inter¬ 
ested  in  the  problem  of  identifying  incentive  solutions  which  were  equally 
acceptable  to  the  rat.  Such  solutions  he  called  isohedonic,  following  the 
lead  of  Guilford  (1954)  who  had  used  the  term  with  respect  to  auditory 
stimuli  which  the  human  subject  found  equally  pleasant.  On  the  basis  of 
our  project  work  with  the  up-and-down  method,  we  proposed  that  an  effective 
way  to  locate  a  solution  mixture  which  was  isohedonic  with  a  given  standard 
solution  (or  mixture)  would  be  to  use  a  group  of  25  to  30  animals  and 
follow  the  up-and-down,  one-trial-per-subject  procedure. 

(2)  Method.  Each  animal  in  the  group  was  given  a  brief,  usually 
3-mlnute,  preference  test  in  which  two  solutions  were  avallable--a  simple 
sucrose  solution  which  was  the  standard,  and  a  comparison  mixture  contain¬ 
ing,  say,  quinine  and  sucrose.  If  the  first  animal  licked  more  of  the 
standard  than  of  the  comparison,  the  next  animal  was  tested  with  a  com¬ 
parison  containing  "one  step  more"  sucrose.  But  if  the  first  animal  licked 
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more  of  the  comparison,  the  second  animal  was  tested  with  a  less  palatable 
comparison,  i.e.  one  containing  "one  step  less"  sucrose.  Similar  moves 
were  made  after  each  other  animal  was  tested.  The  animals  in  the  experi¬ 
mental  group  were  tested  in  a  random  order,  their  responses  providing  a 
group  up-and-down  series  which  hunted  about  that  sucrose  level  which  made 
the  comparison  isohedonlc  with  the  standard.  On  each  new  test  day,  a 
different  measure  was  taken,  with  a  different  concentration  of  sucrose 
in  the  standard  and/or  with  a  different  concentration  of  quinine  in  the 
comparison  mixture .  Each  day  the  animals  were  tested  in  a  new  random 
order,  each  one  again  being  used  for  but  one  trial  on  the  up-and-down 
series . 

(3)  Results .  This  procedure  has  been  followed  in  a  series  of 
three  experiments.  Christensen  (1962)  has  worked  with  sucrose  vs.  sucrose- 
salt  mixtures.  Kappauf,  Burright  and  DeMarco  (in  press,  1963)  have  worked 
with  sucrose  vs.  sucrose-quinine  mixtures,  and  Young  and  Schulte  (in 
preparation)  have  worked  with  sucrose  vs.  sucrose-acid  mixtures.  In  each 
experiment  it  has  been  possible  to  locate  a  series  of  mixtures  all  isohe- 
donic  with  the  same  standard  and  thus  map  complete  isohedonic  contours  or 
isohedons.  The  one-brief-trial-per-subject  procedure  makes  it  quite  cer¬ 
tain  that  these  isohedons  are  descriptive  of  the  animals'  taste  preferences 
and  not  a  function  of  post-ingestional  factors  which,  under  other  test 
conditions,  might  have  influenced  the  animals'  choice  behavior. 

c.  An  empirical  check  on  s^  as  a  predictor  of  variability  in  m  in 
these  experiments .  In  each  of  the  foregoing  animal  experiments  there  were 
some  measurements  which  were  repeated  or  replicated  on  a  later  test  day. 

The  size  of  the  differences  in  m  from  the  first  occasion  to  the  second. 


or  more  specifically  the  root-mean-square  of  these  differences  may  be 
estimated  from  •  *  a®  computed  from  typical  values  of  sffl  and  s. 

Table  6  provides  a  comparison  of  this  measure  of  expected  variability  in 
m  for  each  of  the  experiments,  with  the  variabilities  actually  obtained. 

Included  in  the  table,  along  with  the  animal  data,  are  similar  compu¬ 
tations  based  upon  records  for  the  weight  discrimination  part  of  the  time 
error  study.  Here  there  were  no  formal  replications,  but  we  report  the 
outcome  of  dividing  the  series  of  45  observations  for  each  delay  condition 
into  two  portions,  the  first  20  trials  and  the  last  25  trials.  Values  of 


m  were  computed  for  each  of  these  "halves,"  and  then  the  first-half-second- 
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>  to  be  distinguished  from  Mfcot  ,  is  the  number  of  observations  of  the  less  frequent  response,  i.e.  the 
N  used  in  the  formulae  for  m,  s  and  s  . 
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half  difference  In  the  value  of  m  was  determined  for  each  of  the  four  time 
delays.  Values  of  s  for  the  8  half-series  were  homogeneous  and  were  aver¬ 
aged  to  provide  5. 

It  should  be  noted  that  the  values  of  s  given  in  the  table  are  simple 
averages,  and  hence  are  smaller  than  the  values  which  would  have  been 
obtained  if  values  of  s  had  been  averaged  and  the  root  taken.  In  the 
case  of  each  of  the  last  two  studies  cited  in  the  table,  two  estimates 
of  s  are  given:  one  based  upon  the  variability  of  all  series  run  in  the 
experiment,  and  the  other  based  only  upon  the  variability  of  those  up-and- 
down  series  involved  in  the  replications. 

Examination  of  the  second  and  third  columns  from  the  right  in  Table 
6  Indicates  that  the  obtained  replication  differences  were  closely  ap¬ 
proximated  by  the  error  formula.  But  it  will  be  seen  that  for  each  of 

these  four  studies,  the  value  of  s.  .  was  larger  than  the  observed 

(mi -m2) 

root -mean-square  change  in  m.  We  infer  that  the  variability  of  m,  as 
estimated  from  s,  is  over-estimated.  This  result  is  consistent  with  a 
conclusion  drawn  by  Brownlee  £t  al,  (1953)  from  their  analysis  of  the  up- 
and-down  method  with  small  samples.  The  variance  estimation  formula,  s  , 
is  based  on  asymtotlc  theory,  and  the  validity  of  testing  procedures  using 
sm  for  finite  samples  went  uninvestigated  during  early  work  with  the 
method  (Anon,  1944).  What  Brownlee  et,  j»l  subsequently  observed  was  that 
a  may  provide  a  conservative  estimate  of  the  accuracy  of  m  for  small 

O 

samples  when  up-and-down  series  start  close  to  u,  say  within  2  testing 
levels  of  u.  Such  "close  starts"  did  characterize  the  up-and-down  series 
in  the  present  experiments. 

We  thus  have  good  reason  to  believe  that  for  many,  perhaps  most,  ap¬ 
plications  of  the  up-and-down  method  in  psychological  research  s^  will  be 
a  conservative  estimate  of  the  accuracy  of  m  when  the  one-trial-per-subject 
procedure  is  employed.  This  conservative  feature  of  s  and  the  adequacy 

ID 

of  tests  based  on  s  are  clearly  deserving  of  continued  study*  (Please 
note  in  this  connection  thut  the  tests  citarf  cn  pages  35  and  38  above  wore 
conducted  taking  s^  at  its  face  value). 
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d.  Summary  comments.  For  experiments  where  group  average  performance 
is  to  be  evaluated,  and  where  readying  time  per  subject  is  suitably  brief, 
up-and-down  testing  following  the  one-trial-per-subject  procedure  may  fre¬ 
quently  prove  useful.  One-trial-per-subject  experiments  have  seen  limited 
applications  in  the  past  but  they  have  been  used  (e.g.  Stevens,  1956)  and 
should  be  used  when  it  is  of  importance  to  eliminate  the  effect  of  the 
subject's  having  had  other  recent  experience  in  the  test  situation. 

The  estimate,  s,  obtained  from  the  up-and-down  series  appears  to 
provide  a  satisfactory,  though  somewhat  conservative  basis  for  estimating 
the  reliability  of  m  for  simple  replications  with  the  one-trial-per-subject 
design. 

14.  UP-AND-DOWN  TESTING  WITH  CONCURRENT  SERIES  RUN  UNDER  THE  SAME  TEST 
CONDITIONS  AND  WITH  THE  SAME  SUBJECT:  THE  VARIABILITY  OF  m  FOR 
CONCURRENT  SERIES. 

When  several  concurrent  up-and-down  series  are  run  using  the  same 
standard,  the  same  test  conditions  and  always  the  same  subject,  the  values 
of  m  obtained  for  these  separate  series  each  estimate  the  same  parameter, 

4,  the  subject's  PSE.  Variability  among  these  estimates  will  be  influenced 
by  a  number  of  factors:  first,  by  a,  G,  and  Nc,  as  we  know  from  the  for¬ 
mula  for  s^;  second,  by  differences  between  the  initial  testing  levels 
used  for  the  different  series,  in  the  event  that  some  or  all  initial 
levels  are  far  enough  from  4  to  bias  m;  and  third,  by  whatever  serial 
dependencies  may  exist  among  the  Judgments  or  responses  of  the  session-long 
program  of  trials. 

a.  Effect  of  differences  in  initial  testing  level  on  the  values  of 
m  for  concurrent  series.  We  have  already  discussed  the  general  problem 
of  bias  in  m  associated  with  the  starting  level  of  an  up-and-down  series, 
and  have  considered  ways  of  minimizing  this  bias  (see  pages  20-22  above) . 
From  that  discussion  we  recognize  that  if  our  several  concurrent  series 
(or  the  portions  used  in  computing  values  of  m)  all  start  near  4,  the  bias 
will  be  small  for  each,  and  so  the  influence  which  differences  in  starting 
level  will  have  upon  the  values  of  m  for  the  different  series  must  also  be 
small.  We  anticipate,  therefore,  that  concurrent  series  may  be  managed  so 
that  the  variability  of  m  will  not  be  enhanced  by  differences  in  initial 
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testing  level. 

In  connection  with  the  analysis  of  the  data  of  two  loudness  dis¬ 
crimination  experiments  (L-15  and  L-16  which  will  be  described  below) , 
McDiarmid  (1962a)  had  occasion  to  pull  out  a  sum  of  squares  associated 
with  initial  testing  level.  In  these  experiments,  there  were  many 
cases  where  two  relatively  short  concurrent  series,  with  fewer  than  25 
trials  per  series,  were  run  with  the  same  standard.  These  series  started 
2  stimulus  levels  apart,  at  levels  which  were  the  same  for  all  subjects 
in  a  particular  test  group.  PSE's  were  different  for  different  subjects, 
of  course,  but  the  average  distance  of  the  PSE  from  the  more  distant 
starting  level  was  only  some  2  or  2-1/2  stimulus  steps.  These  conditions 
should  introduce  little  bias  in  the  values  of  m,  and  indeed  McDiarmid 
could  find  in  his  analysis  no  basis  for  rejecting  the  hypothesis  that 
initial  testing  level  had  been  without  effect  upon  m. 

b.  Denendecies  between  successive  observations.  Some  exceptionally 
long  concurrent  series  were  run  in  a  portion  of  one  of  the  foregoing 
loudness  experiments  (L-16,  Part  I  for  Group  R-l) .  The  records  of  these 
series  permitted  a  cross-correlation  study  of  dependencies  between  suc¬ 
cessive  observations  in  the  test  session  (See  McDiarmid,  1962a). 

This  part  of  the  experiment  involved  four  concurrent  series.  Trials 
were  run  taking  the  four  series  in  simple  rotation:  one  trial  from  series 
#1  for  the  less  intense  standard,  one  from  series  #1  for  the  more  intense 
standard,  one  from  series  #2  for  the  less  intense  standard,  one  from 
series  #2  for  the  more  intense  standard,  etc.  The  number  of  trials  per 
series  was  75.  Context  effects  had  stabilized  by  the  end  of  25  trials 
per  series,  so  the  last  50  trials  per  series  were  appropriate  for  cross- 
correlational  analysis.  Because  of  the  Influence  of  range  on  r,  it 
seemed  desirable  that  this  analysis  be  carried  out  with  the  more  variable 
of  the  available  series,  l.e.  with  series  where  the  chosen  step  size  had 
been  relatively  smaller.  This  directed  our  attention  to  the  two  series 
for  the  less  intense  standard. 

Of  the  final  200  trials  in  the  test  session,  trials  101,  105,  109, 

.  .  .  297  belonged  to  one  series  with  the  less  intense  standard,  and 
trials  103,  107,  111,  .  .  .  299  belonged  to  the  other.  For  purposes  of 
computing  a  cross-correlation  between  the  stimulus  levels  used  in  these 
two  series,  the  paired  trials  might  be  taken  as  101  and  103,  105  and  107, 
etc.  or  they  might  be  taken  as  103  and  105,  107  and  109,  etc.  It  is  clear 


45 


that  there  is  no  reason  to  prefer  one  of  these  pairings  to  the  other, 
and  that  each  pairing  produces  a  biased  value  of  r  depending  upcn  the 
phase  relationships  between  the  two  up-and-down  series.  (This  bias 
reaches  its  limit  when  the  step  size  is  very  large  and  u  is  between  two 
levels.  Then  each  series  oscillates  back  and  forth  between  two  test 
levels,  one  on  either  side  of  u.  Under  these  circumstances,  one  pairing 
of  the  trials  produces  an  r  of  4-1.00  and  the  other  an  r  of  -1.00).  Ue 
therefore  took  the  average  of  the  correlations  found  for  the  two  dif¬ 
ferent  cross-pairings  of  trials  as  our  measure  of  the  cross-correlation 
of  the  two  concurrent  up-and-down  series.  The  data  for  the  8  subjects 
who  served  in  the  group  under  consideration  given  in  Table  7. 

As  expected  from  its  nature,  r  varied  with  the  size  of  s.  The  up- 
and-down  series  for  the  most  variable  subject  drifted  considerably  during 
the  trials  under  study,  and  the  correlation  measure  for  him  shows  that 
indeed  the  two  series  generally  moved  up  and  down  together.  Some  of 
the  other  correlations  are  small,  but  it  will  be  noted  that  seven  of  the 
eight  average  cross-correlations  are  positive. 

The  tendency  found  here  for  concurrent  up-and-down  series  on  the 
same  standard  to  drift  up  and  down  together  adds  to  accumulating  data 
in  the  literature  which  point  to  trlal-to-trial  serial  dependencies.  It 
is  noteworthy  here  that  the  present  dependencies  were  observed  at  a  lag 
of  2  trials,  one  trial  with  the  more  intense  standard  having  intervened 
between  the  members  of  every  pair  of  trials  which  entered  into  the  cross- 
correlations.  Presumably  the  cross-correlations  for  two  concurrent  series 
taken  in  simple  alternation  on  successive  trials  would  be  greater  than 
those  cited  in  Table  7 . 

c.  The  variability  of  values  of  m  for  concurrent  series  run  under 
the  same  test  conditions.  From  the  foregoing  correlations  we  may  presume 
that  fluctuations  of  the  subject's  criterion  of  equality  or  changes  in 
his  response  habits  occur  during  the  test  session.  Such  changes  must 
have  the  effect,  as  suggested  on  page  32  above,  of  raising  the  variability 
observed  within  the  single  up-and-down  series,  while  at  the  same  time 
stabilizing  to  a  certain  extent  values  of  m  for  the  different  series. 

This  then  la  a  factor  which  will  cause  the  values  of  m  for  concurrent 
series  run  under  the  same  conditions  to  vary  less  than  we  would  expect 
on  the  basis  of  obtained  values  of  s  and  s  . 
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Table  7 


CROSS -CORRELATIONS  BETWEEN  CONCURRENT  UP-AND-DOWN  SERIES 
FCR  8  SUBJECTS  IN  A  LOUDNESS  DISCRIMINATION  EXPERIMENT 

(Data  for  Group  R-l,  Experiment  L-16,  Part  I,  53  db.  standard) 


Sublect 

Average 

Step 

Cross 

-correlations 

value  of 
s  for  2 
concurrent 
series:  s 

size  in 
units 
of  8 

for  one 
phasing 
of  trials 

for  other 
phasing 
of  trials 

Average 

1 

5.10  steps 

.20s 

+  .53 

+  .54 

+  .53 

2 

0.97  steps 

1.03s 

+  .13 

-.10 

+  .01 

3 

1.86  steps 

.54s 

+  .07 

+  .05 

+  .06 

4 

1,66  steps 

.60s 

+  .09 

+  .21 

+  .15 

5 

1.19  steps 

.84s 

.00 

-.06 

-.03 

6 

1.94  steps 

.52s 

+  .26 

+  .24 

+  .25 

7 

1.44  steps 

.69s 

+  .08 

+  .26 

+  .17 

8 

1.12  steps 

.89s 

+  .28 

+.16 

+  .26 

1.91  +.17 


Average 
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We  heve  already  seen,  of  course,  chat  e  la  a  conservative  estimate  of 
the  reliability  of  m  In  the  one-trial-per-subjoct  situation,  eo  the 
occurrence  of  positive  cross-correlation  between  concurrent  series  means 
t*lat  sm  should  be  an-evenr more  conservative  estimate  of  the  reliability 
of  m  for  concurrent-series. 

McDlarmid  (1962a),  continuing  the  analysis  of  the  data  which  provided 
the  correlation  measures  in  Table  7,  made  further  calculations  to  compare 
the  average  value  of  s  for  the  8  subjects  in  the  group  with  an  estimate  of 
or  based  on  the  observed  variation  in  the  value  of  m  from  one  concurrent 
series  to  the  other.  In  Table  8  we  present  a  summary  of  his  calculations, 
recast  in  a  form  whJ oh  compares  expected  and  observed  concurrent-series 
differences  in  m.  This  table  is  thus  a  direct  parallel  of  Table  6,  page  41, 
The  compurisor.  on  interest  here  is  between  the  last  two  columns  at  the  right. 
Again,  as  in  Table  6,  s^  leads  to  an  overestimate  of  the  root -mean-square 
difference  in  the  obtained  values  of  m.  For  the  less  intense  standard,  as 
compared  with  the  more  intense  standard ,  the  average  value  of  s  was  larger 
(step  size  smaller) ,  and  it  was  for  this  reason  that  the  correlations  In 
Table  7  were  computed  for  the  less  intense  standard.  Cross -correlations 
for  the  more  intense  standard  were  not  calculated,  but  they  must  have  been 
less  than  those  cited  in  Table  7.  Were  they  close  to  zero,  the  observed 
and  expected  root-mean-square  concurrent-series  difference  in  m  should  have 
differed  by  an  amount  similar  to  the  discrepancies  found  in  Table  6.  And 
this  is  the  case  for  that  standard.  For  the  less  intense  standard,  however, 
the  discrepancy  is  greater  than  any  reported  in  Table  6,  a  result  associated 
with  the  observed  cross-correlations  and  interdependencies. 

It  would  appear  from  these  records  that  with  step  sizes  in  the  range 
thought  to  be  desirable,  namely  1.5  to  2.0a,  the  effect  of  cross-correlation 
on  the  variability  of  m  will  be  neglible:  that  for  such  step  sizes,  the 
variability  of  m  for  concurrent  series  will  be  much  like  the  variability  of 
m  for  completely  independent  series.  This  nutter  deserves  further  checking, 
but  it  seems  clear  at  the  moment  that  the  best  opportunity  for  significant 
cross-correlational  effects  on  the  stability  of  m  will  occur  when  a  small 
step  size  is  chosen. 

d.  Implications  for  statistical  testa  of  values  of  m.  Until  further 
information  is  forthcoming  on  the  merits  of  various  test  procedures  based 
upon  sm  for  short  up-and-down  series,  we  have  two  ways  in  which  we  may  proceed 
in  conducting  significance  tests  concerning  values  of  m.  One  is  to  use  the 
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formula*  and  procedure#  baaed  on  s^,  appreciating  that  our  evaluation  of 
differences  in  m  may  be  conservative.  The  other  is  to  use  concurrent-series 
variability  in  the  values  of  m  as  our  error  measure  for  m.  Thus,  for  example, 
in  a  single-session  experiment  where  two  (or  more)  concurrent  series  are 
run  under  each  of  several  test  conditions,  differences  between  conditions 
may  be  evaluated  by  analysis  of  variance  procedures  applied  to  the  obtained 
values  of  m,  disregarding  completely  the  variability  of  the  individual 
series.  The  latter  procedure  has  been  used  and  discussed  in  detail  by 
McDiarmid  (1962a) . 

15.  SOME  UP-AND-DOWN  DATA  ON  THE  PROBLEM  OF  SESSION-TO-SESSION  VARIABILITY 

OF  MEASURES  ON  A  SINGLE  SUBJECT:  THE  VARIABILITY  OF  ffi  FOR  DIFFERENT 

SESSIONS. 

Preceding  sections  have  discussed  two  questions  related  to  the  reli¬ 
ability  of  estimates  of  u  obtained  by  the  up-and-down  method:  the  varia¬ 
bility  of  m  across  replications  with  the  one-trlal-per-subject  design,  and 
the  variability  of  m  for  concurrent  series  with  the  same  subject.  Still 
another  aspect  of  the  reliability  problem  is  that  which  concerns  the  varia¬ 
bility  of  estimates  of  u  obtained  during  different  experimental  sessions 
with  the  same  subject.  Let  us  assume  that  the  testing  program  for  each 
session  involves  the  use  of  concurrent  series.  We  compute  m  for  each  up- 
and-down  series,  and  find  the  average  value  of  m  (i.e.  3)  for  each  session. 
How  consistent  are  the  values  of  m  for  different  sessions?  How  is  the 
variability  of  m  related  to  available  measures  of  wlthln-session  variability? 

a.  Source  of  data.  For  data  on  these  questions  we  have  examined  the 
records  of  two  experiments  which  will  be  discussed  in  more  detail  in  a 
section  soon  to  follow.  The  experiments  were  not  conducted  specifically  for 
purposes  of  looking  at  session-to-session  variability,  but  each  subject  in 
each  of  the  studies  was  run  twice  under  comparable  conditions  using  the 
up-and-down  method  with  three  concurrent  series.  In  the  one  study  on  dura¬ 
tion  discrimination,  up-and-down  testing  constituted  the  entire  experimental 
session  on  two  of  four  test  days.  In  the  other  study  on  stereoscopic  dis¬ 
crimination,  up-and-down  testing  constituted  the  opening  half  of  two  experi¬ 
mental  sessions.  So  for  all  subjects  we  had,  and  report  here  in  Table  9, 
information  on  the  variability  of  m  from  one  session  to  a  second,  where 
conditions  were  the  same  in  both  sessions. 

b.  Analysis .  The  nature  of  our  analysis  becomes  clear  in  terms  of  our 


entries  in  Table  9: 
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Column  2  lists  the  average  number  of  trials,  n  ,  on  which  were 
based  the  computations  of  m  and  s  for  the  individual  series. 

Columns  3  and  4  present  the  average  value  of  s  for  the  six  up- 
and-down  series  for  each  subject,  and  the  reciprocal  of  this  which  indi¬ 
cates  the  average  step  size  for  him  in  standard  deviation  units. 

Column  5  gives  the  variance  of  m  as  estimated  from  3. 

Column  6  presents  a  measure  of  variability  Introduced  here  for 
the  first  time  in  these  discussions  and  designated  (by  us)  by  the  symbol, 

S.  S  was  computed  using  the  formula  for  s  but  taking  as  data  the  combined 
frequency  distributions  for  three  concurrent  series.  Thus  we  obtained  but 
one  value  of  S  per  session,  representing  a  mixture  of  withln-series  and 
between-series  variability  for  that  session.  S  is  the  average  value  of  S 
over  the  two  sessions. 

Column  7  lists  the  variance  of  m  as  estimated  from  S,  and  repre- 
2 

seated  by  the  symbol  S-.  Since  each  value  of  m  entails  three  times  as  many 

m  2 

observations  as  each  value  of  m,  S-  may  be  as  small  as,  but  cannot  be 

“  2 

smaller  than,  one-third  the  value  of  s  . 

m 

Columns  8  and  9  provide  the  mean  squares  for  concurrent  series 
and  for  sessions  obtained  from  an  analysis  of  variance  of  the  six  values 
of  m  for  each  subject. 

Finally,  columns  10  through  13  give  four  variance  ratios  which 

are  of  interest,  with  those  values  which  are  "significant"  at  the  .05 

2 

level  indicated  by  asterisks.  Although  s  1b  not  a  traditional  variance 

ID 

measure,  tests  equivalent  to  those  in  columns  10  and  12  have  been  suggested 
as  suitable  (Anon.,  1944)  . 

Looking  first  at  column  10,  we  see  that  for  12  of  the  20  subjects,  m 
for  concurrent  series  was  less  variable  than  expected  from  s  .  This  trend 
was  due  entirely  to  differences  for  those  subjects  with  the  smaller  step 
sizes,  and  is  in  line  with  the  discussion  on  page  47  above,  to  the  effect 
that  the  likelihood  of  s  overestimating  the  variability  of  m  varies  with 

ID 

step  size.  Unexpected  was  the  finding  that  for  two  subjects  there  was 
"significantly"  greater  variability  between  values  of  m  for  concurrent 
series  than  expected  from  s  .  These  two  cases  can  only  be  ascribed  to 

ID 

sampling  error. 

With  regard  to  the  variability  of  ft  from  session  to  session,  we  see 
high  variability  for  4  of  the  duration  subjects  and  5  of  the  stereo  subjects. 
The  test  in  column  11  based  on  the  analysis  of  variance  is  of  relatively 
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low  sensitivity  because  of  the  smell  number  of  degrees  of  freedom  Involved 
In  the  error  term,  but  In  general  we  have  agreement  with  the  tests  given 
columns  12  and  13. 

c.  Discussion.  Of  interest  is  the  extremely  large  size  of  many  of  the 
variance  ratios  for  the  session  effect.  These  values  imply  very  marked 
changes  in  |i  between  sessions.  Ue  may  conclude  that,  at  least  for  some 
subjects,  marked  drifts  in  n  or  changes  in  the  criterion  of  judgment  occur 
from  session  to  session.  Whether  such  drifts  are  characteristic  of  all 
or  most  subjects  will  become  more  apparent  when  more  critical  experiments 
are  run  extending  over  more  than  two  sessions.  For  such  experiments,  it 
appears  that  any  of  the  three  different  variance  ratios  used  in  Table  9 
to  test  session  effects  should  be  adequate. 

In  general  the  psychophysical  literature  has  not  given  much  attention 
to  sesslon-to-session  change  in  the  FSE.  That  such  change  might  occur  is 
of  course  not  unexpected  in  terms  of  the  occurrence  of  drifts  within  sea-  -  . 
sions.  Perhaps  the  relative  ease  of  evaluating  between-sesslon  effects 
by  the  up-and-down  method  will  lead  to  further  investigation  of  this 
session  problem. 

16.  AN  EXPERIMENTAL  COMPARISON  OF  THE  UP-AND-DOWN  METHOD  AND  THE  METHOD  OF 

CONSTANT  STIMULI:  RESULTS  OF  TWO  EXPERIMENTS. 

(a)  Background .  In  the  introduction  to  this  report  it  was  pointed  out 
that  the  method  of  constant  stimuli  is  a  method  in  which  session-wide  stimu¬ 
lus  context  is  controlled  by  the  experimenter,  and  in  which  the  Judgment 
distribution  or  judgment  context  is  a  function  of  the  subject's  responses 
to  those  stimuli.  In  the  up-and-down  method,  on  the  other  band,  session- 
wide  judgment  context  is  controlled  by  the  nature  of  the  sequential  program 
of  trials,  and  the  stimulus  distribution  or  stimulus  context  is  a  function 
of  the  subject's  discriminative  behavior.  Both  methods  are  used  to  estimate 
u  and  o’  of  the  psychometric  function.  From  the  estimate  of  u,  we  quantify 
the  subject's  bias,  as  (PSE-POE)  or  (m-POE) ,  and  in  the  estimate  of  o  we 
have  a  direct  measure  of  the  subject's  differential  sensitivity. 

One  of  the  early  tasks  of  our  project  was  to  compare  estimates  of  u  and 
o  obtained  by  these  two  methods.  This  comparison  was  motivated  by  the  expec¬ 
tation,  shared  with  other  experimenters,  that  when  a  moderate  to  large  Judg¬ 
ment  bias  exists,  attempts  to  measure  it  by  the  method  of  constant  stimuli 
will  typically  result  in  measures  which  are  too  small.  This  expectation  is 
based  on  the  premise  that  the  PSE  obtained  by  this  method  will  be  constrained 
near  the  POE  when  the  comparison  stimuli  are  symmetrically  distributed  about 
the  POE.  There  are  two  arguments  for  this  view:  (a)  The  PSE  tends  toward 
the  center  of  the  comparison  series  when  the  latter  is  asymmetrically  located 
with  reference  to  the  POE  (see,  for  example,  Harris,  1948).  Surely  if  this 
is  so,  it  is  reasonable  to  suppose  that  the  PSE  is  also  Influenced  by  stimu¬ 
lus  context  when  a  symmetrically  located  series  of  comparison  stimuli  is 
used.  This  would  keep  the  PSE  near  the  POE,  and  would  mean  a  depressed 
measure  of  bias,  (b)  Subjects  in  psychophysical  experiments  frequently 
appear  to  be  set  on  disposed  to  use  the  two  opposed  responses,  "heavier"  vs. 
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"lighter" ,  "louder"  vs.  "softer",  etc.,  equally  often.  Such  a  set  favors 
the  PSE  being  near  the  POE  if  the  comparison  stimuli  fall  symmetrically 
about  the  POE. 

The  up-and-down  method,  by  contrast,  hunts  for  the  PSE.  It  cannot  con¬ 
strain  the  PSE  because  it  imposes  no  stimulus  context  upon  the  subject. 

Rather  it  allows  the  subject  to  have  "whatever  comparison  stimuli  he  wants" 
in  order  to  locate  the  PSE.  And  a  second  feature  of  the  up-and-down  method 
which  is  equally  interesting  is  that  while  it  does  impose  a  judgment  context 
on  the  subject,  that  judgment  context  is  the  50-50  one  which  he  appears  to 
expect  anyway. 

These  considerations  lead  to  the  hypothesis  that  a  subject  who  has  a 
Judgment  bias  with  regard  to  a  given  differential  discrimination,  will  evi¬ 
dence  a  larger  bias  when  tested  by  the  up-and-down  method  than  when  tested 
by  the  method  of  constant  stimuli.  It  may  also  be  conjectured  that  such 
constraint  as  the  method  of  constant  stimuli  may  impose  on  the  PSE 
will  make  PSE's  determined  by  this  method  more  uniform  or  stable  from  day  to 
day  or  from  subject  to  subject  than  PSE's  determined  by  the  up-and-down  method. 

b.  Experiment  CS-li  A  comparison  of  the  method  of  constant  stimuli  and 
the  up-and-down  method  in  a  study  of  the  discrimination  of  auditory  durations. 

(1)  Specific  purpose;  Previous  work  by  Stott  (1935),  Woodrow  and 
Stott  (1936)  ,  and  others  has  shown  that  the  time  error  for  very  short  dura¬ 
tion  is  positive,  while  that  for  longer  durations  is  negative.  The 
"Indifference"  duration,  or  duration  where  the  transition  occurs  between 
positive  time  errors  and  negative  time  errors,  is  estimated  to  be  in  the 
range  of  1  to  2  seconds.  From  the  arguments  presented  above,  we  would 
expect  the  up-and-down  method  to  reveal  a  larger  positive  time  error  than 
the  method  of  constant  stimuli  when  the  duration  being  judged  is  short, 
say  of  the  order  of  0.5  seconds,  and  a  larger  negative  time  error  when  the 
duration  being  judged  is  long,  say  of  the  order  of  5.0  seconds.  In  other 
words,  the  function  relating  time  error  to  duration  should  be  clearer  or 
steeper  for  the  up-and-down  method  than  for  the  method  of  constant  stimuli. 
Further,  if  the  method  of  constant  stimuli  does  place  constraints  upon  the 
PSE  in  terms  of  context  effects,  then  differences  between  the  PSE's  for 
different  Individuals  should  be  less  by  the  method  of  constant  stimuli  than 
by  the  up-and-down  method . 

Our  purpose  then  was  to  compare  the  two  methods  with  regard  to  time 
error  magnitudes  and  Individual  differences  in  time  errors. 
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(2)  Apparatus  and  general  procedure.  The  auditory  stimulus  used 
in  this  experiment  was  white  noiae.  It  was  of  moderate  intensity,  keyed  by 
an  electronic  switch  with  50  msec,  rise  time,  and  presented  by  loudspeaker 
located  some  six  feet  from  the  subject.  The  subject  sat  alone  in  a  small 
experimental  room  and  indicated  his  judgments  by  throwing  one  of  two  spring- 
loaded  lever-type  switches.  He  pushed  one  key  if  he  judged  the  second  sound 
to  be  longer  in  duration  than  the  first,  the  other  key  if  he  judged  it  to 

be  shorter  than  the  first. 

There  were  31  stimulus  durations  available,  each  differing  from 
its  neighbors  in  the  series  by  l/24th  of  a  log  unit.  The  series  ran: 

.38  secs.,  .41,  .45,  J>0,  .55,  .61,  .67,  ...1.19,  1.31,  1.44,  1.58.  1.74, 
1.92,  ...3.75,  4.13,  4,54,  5 .00 .  5.50,  6.06,  and  6.67  secs. 

Trials  for  both  the  method  of  constant  stimuli  and  the  up-and-down 
method  were  presented  using  the  automatic  programmer  described  in  the  appen¬ 
dix  to  this  report.  This  equipment  was  located  in  a  room  adjacent  to  the 
subject's  room.  Both  the  stimulus  sequence  and  the  subject's  responses  were 
recorded  on  a  multi-channel  unit  using  electro-sensitive  recording  paper. 

The  subject  was  alerted  for  each  trial  by  a  small  panel  light  which 
came  on  aa  a  warning  signal  2  seconds  before  the  standard.  The  standard  was 
a  sound  of  either  .50,  1.58  or  5.00  seconds  duration,  the  three  values  under¬ 
lined  in  the  above  series.  The  inter-stimulus  Interval,  following  the 
standard  and  preceding  the  comparison,  was  always  5.00  seconds.  The  inter¬ 
val  between  trials,  from  the  end  of  the  comparison  to  the  next  warning  signal 
was  9.75  seconds.  Maximum  times  per  trial  were  thus  of  the  order  of  25 
seconds . 

Tho  subject  made  90  judgments  per  day  on  four  consecutive  exper¬ 
imental  days.  After  every  15  trials  he  had  a  two-minute  rest.  All  sessions 
were  completed  in  less  than  an  hour.  First  day  sessions  were  the  longest  in 
that  they  included  instructions  and  a  brief  series  of  practice  trials. 

(3)  Subjects.  The  subjects  were  12  young  men,  either  high  school 
seniors  or  college  undergraduates  attending  the  1959  Summer  Session.  All 
were  paid  for  participating  in  the  experiment,  most  in  fact  having  been 
rocrulted  through  the  Student  Employment  Service. 

(4)  Design.  Each  subject  was  tested  with  but  one  standard,  making 
the  design  a  random-groups  design  with  four  subjects  per  group.  IVo  of  the 
subjects  in  each  group  were  tested  using  the  method  of  constant  stimuli  on 
days  1  and  3,  the  up-and-down  method  on  days  2  and  4,  while  the  other 
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subject*  were  tested  using  the  uaothods  in  the  opposite  order. 

(5)  Comparison  series  for  the  method  of  constant  stimuli.  Five 
comparison  durations  were  used  with  each  standard.  One  of  these  was  equal  to 
the  standard.  The  others  were  the  two  durations  immediately  shorter 
than  and  the  two  Immediately  longer  than  the  standard.  Thus  each  comparison 
series  was  symmetrically  distributed  about  the  standard  on  the  logarithmic 
scale  of  duration.  The  order  in  which  the  several  comparison  stimuli  were 
used  on  successive  trials  was  random,  subject  to  the  two  restrictions  that 
the  same  comparison  never  be  used  on  two  successive  trials,  and  that  each 
comparison  occur  three  times  in  each  set  of  15  trials. 

(6)  Programming  for  the  up-and-down  method.  Three  concurrent 
series  with  the  same  standard  were  used  in  the  up-and-down  sessions.  Each 
series  was  programmed  using  a  different  add-and -subtract  stepper.  The  start¬ 
ing  level  for  each  series  was  randomly  choacn- -either  equal  to  the  standard, 
the  standard  minus  one  level,  or  the  standard  plus  one  level.  The  sequence 

in  which  trials  were  taken  from  the  three  series  was  based  on  random  per¬ 
mutations  of  the  three,  subject  to  the  restriction  that  no  given  series 
be  used  on  two  successive  trials.  On  the  second  experimental  day  with  the 
up-and-down  method,  each  series  was  continued  where  it  had  left  off  at  the 
end  of  the  first  day. 

(7)  Determination  of  the  PSE's  and  difference  thresholds.  Data 
for  each  session  under  the  method  of  constant  stimuli  were  tallied  in  the 
usual  way,  plotted  on  normal  probability  paper  and  a  straight  line  fitted  to 
the  points  by  eye.  The  latter  operation  was  done  independently  by  two  exper¬ 
imenters.  The  plotted  probabilities  were  based  upon  18  observations  per 
comparison  stimulus  for  each  experimental  session.  From  each  fitted  line, 
the  value  of  the  median  and  the  standard  deviation  were  read  as  estimates  of 
u  and  a  of  the  psychometric  function.  The  median  and  standard  daviatlon  for 
each  session  are  hereafter  designated  as  Mda^  and  SD^,  where  the  subscript 
stands  for  "visual  line."  Their  values  are  averages  of  the  two  experimenters^ 

For  the  up-and-down  method,  m  and  s  were  computed  for  each  series 
separately  on  each  experimental  day.  The  values  of  m  for  the  three  series 
were  averaged  to  obtain  m  for  the  session.  Values  of  s  were  averaged  to  ob¬ 
tain  i.  All  calculations  were  based  on  all  trials  of  the  up-and-down  series, 
in  as  much  as  m  was  never  very  far  from  the  initial  testing  level  and  biasing 
effects  of  that  testing  level  must  have  been  very  small. 
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As  will  be  explained  below,  the  data  for  all  up-and-down  ses¬ 
sions  were  also  subjected  to  graphic  analysis  —  partly  aa  a  check  that  the 
outcome  of  the  experiment  did  not  hinge  on  differences  in  analyais  between 
the  up-and-down  and  constant  methods,  and  partly  as  a  check  on  the  degree  of 
agreement  of  this  analysis  with  that  employing  the  Princeton  formulae. 

The  results  of  the  study  are  summarized  in  Figures  5  and  6  and  in 

Table  10 > 

(8)  Results  on  time  error  magnitudes.  Figure  5  is  a  scatterplot 
of  bias  measures,  subject  by  subject,  by  the  two  psychophysical  methods. 

The  data  plotted  are  the  two-session  averages  for  each  subject,  as  listed 
in  columns  3  and  7  of  Table  10.  Had  bias  measures  been  the  same  or  similar 
by  the  two  methods ,  the  points  would  have  been  fitted  well  by  the  45-degree 
line  in  the  figure.  As  will  be  seen,  3  of  12  subjects  had  essentially  no 
bias  as  measured  by  either  method  (less  than  0.1  stimulus  Btep)  .  For  1 

of  the  remaining  9  subjects,  a  larger  bias  was  found  in  data  obtained  by 
the  method  of  constant  stimuli,  for  1  the  bias  was  equal  by  both  methods, 
and  for  7  the  measured  bias  was  larger  by  the  up-and-down  method.  It  may 
also  be  noted  that  of  7  bias  measures  which  were  0.5  stimulus  step  or  larger, 
6  were  obtained  with  the  up-and-down  method . 

(9)  Results  on  time  errors  vs.  duration.  The  relation  of  the 
time  error  measurements  to  stimulus  duration  is  plotted  in  Figure  6.  The 
fact  that  the  function  is  steeper  for  data  obtained  by  the  up-and-down 
method  is  interpreted  to  mean  that  the  function  determined  by  the  method 
of  constant  stimuli  was  flattened  by  stimulus  context  factors  which  limit 
bias  measures  by  that  method. 

(IQ)  Results  on  Individual  differences.  The  range  of  bias  measures 
in  each  group  of  subjects  was  clearly  greater  for  the  up-and-down  method. 

This  can  be  seen  in  Figure  6,  as  well  as  in  a  comparison  of  the  values  listed 
in  columns  3  and  7  of  the  table. 

(11)  Results  on  session-to-session  differences.  Session- 
differences  for  the  two  methods  are  compared  in  columns  2  and  6  of  the 
table.  For  6  of  the  12  subjects,  session-to-session  differences  were  lar¬ 
ger  by  the  method  of  constant  stimuli  and  for  6  they  were  smaller. 

Given  the  condition  of  fully  independent  observations,  the  up- 
and-down  method  is  known  to  be  more  efficient  than  the  method  of  constant 
stimuli,  requiring  some  30Z  fewer  trials  for  the  same  precision  of  measure¬ 
ment  (Dixon  and  Mood,  1948).  Thus,  for  the  same  number  of  trials  the 
standard  error  of  m  should  be  less  than  the  standard  error  of 
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AVERAGE  BIAS  AS  MEASURED  BY  THE  METHOD  OF  CONSTANT  STIMULI 
(Average  Mdn^  to  stimulus  steps) 


Figure  5:  Relation  between  estimates  of  u  by  the  method  of 
constant  stimuli  and  by  the  up-and-down  method  in  the  duration  study. 
The  plotted  points  are  for  the  12  individual  subjects. 
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BIAS  IN  DIFFERENTIAL  JUDGMENTS  OF  AUDITORY  DURATION , 

AS  MEASURED  BY  THE  UP-AND-DOWN  METHOD  AND  BY  THE  METHOD  OF  CONSTANT  STIMULI. 

Note:  In  the  present  table  values  of  fi  and  mdn^  are  given  directly  as  bias 
measures  in  stimulus  steps.  Estimates  of  a  are  also  given  in  stimulus  steps. 
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DURATION  OF  THE  STANDARD  —  IN  SECS. 


Figure  6:  Tine  arror  for  auditory  duration  aa  a  function  of  tha 
duration  of  tha  atandard  atlmulua.  Lattara  ara  aubjecta1  lnitlala. 

Clrclad  ayubola  ■  up-and-down  maaaurea.  Uncircled  aynbola  ■  conatant 
atlmulua  maaaurea. 
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even  if  we  forget  that  our  random  errors  in  visually  estimating  the  line 
of  best  fit  must  also  increase  the  standard  error  of  indn^.  Opposing  this 
trend  and  working  to  decrease  variability  in  mdn^^t  however,  is  restriction 
of  bias  as  a  result  of  stimulus  context.  We  had  supposed  that  this  effect 
might  be  strong  enough  that  session-to-session  differences  in  estimates  of 
U  would  be  even  less  by  the  method  of  constant  stimuli  than  by  the  up-and- 
down  method.  The  data  indicate  no  such  extreme  result,  but  bias  restric¬ 
tion  was  apparently  sufficient  that  the  method  of  constant  stimuli  did  not 
exceed  the  up-and-down  method  in  session-to-sesslon  variability. 

(Our  interest  here,  of  course,  has  been  only  in  a  comparison  of 
session-to-session  variabilities  for  our  two  methods.  An  evaluation  of 
session-to-session  variability  of  the  single  subject  tested  with  concur¬ 
rent  series  during  each  session  has  appeared  above  in  section  15.) 

(11)  Summary.  As  had  been  expected,  larger  time  errors  were  ob¬ 
tained  in  the  discrimination  of  auditory  durations  when  the  up-and-down 
method  was  used  than  when  the  method  was  that  of  constant  stimuli.  Indiv¬ 
idual  differences  in  judgment  bias  were  also  greater  by  the  up-and-down 
method.  These  results  support  the  view  that  PSE's  obtained  by  the  method 
of  constant  stimuli  are  constrained  to  be  near  the  POE,  making  measures  of 
bias  Improperly  small  by  that  method. 

Concerned  in  this  experiment  were  Drucker,  McDlarmid  and  the 

author . 

c .  Analysis  of  up-and-down  data  for  experiment  CS-1  in  terms  of  response 
proportions .  It  is  of  interest  as  a  control  for  the  foregoing  results, 
and  in  a  more  general  sense  also,  to  examine  the  consequences  of  analyzing 
up-and-down  data  by  the  same  procedures  used  in  processing  data  collected 
by  the  method  of  constant  stimuli.  We  have  therefore  compared  the  estimates 
m  and  s  with  parallel  estimates  of  u  and  a  obtained  by  applying  graphic 
methods  to  response  proportions  calculated  for  the  different  comparison 
stimulus  levels  used  in  the  up-and-down  sessions . 

For  this  analysis,  all  responaea  for  the  three  concurrent  up-and-down 
series  for  a  given  subject  on  a  given  day,  were  combined  into  a  single 
distribution  for  that  session.  The  proportion  of  "longer"  Judgments  was 
computed  for  each  stimulus  level  and  plotted  on  normal  probability  paper. 

A  visually  determined  line  of  "best"  fit  was  drawn  to  the  plotted  points, 
and  estimates  of  u  and  a  read  from  the  line.  These  were  compared  with  m 
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and  S  computed  for  that  session  (i.e.  with  average  m  and  average  a  for  the 
concurrent  series).  The  record  of  12  subjects,  each  tested  on  2  days, 
provided  24  sessions  on  which  comparative  data  were  assembled. 

Several  features  of  an.  analysis  of  up-and-down  data  in  terms  of 
response  proportions  may  be  anticipated.  These  included  the  following: 

(1)  Whatever  the  number  of  levels  used  in  the  up-and-down  series, 
obtained  response  proportions  will  typically  be  more  reliable  for  stimulus 
levels  near  the  center  of  the  range  of  levels,  for  it  is  the  primary  fea¬ 
ture  of  the  up-and-down  method  that  trials  will  be  concentrated  near  the 
.50  point  of  the  psychometric  function.  This  suggests,  in  fact,  that  the 
two  points  bracketing  the  median  might  be  taken  as  the  principal  basis 

for  forming  the  line  of  best  fit.  In  the  extreme,  one  might  merely  inter¬ 
polate  between  these  two  to  locate  the  median,  and  this  we  have  also  done 
below. 

(2)  Whatever  the  number  of  levels  ustd  in  the  up-and-down  series 
upon  which  the  response  proportions  are  based,  the  uppermost  stimulus  level 
has  associated  with  it  nothing  but  "longer"  responses,  while  the  lowest 
stimulus  level  has  no  "longer"  responses.  In  general  it  appears  that 
these  extreme  proportions  of  1.00  and  .00  should  be  set  aside,  not  only 
because  they  cannot  be  located  properly  on  the  normal  probability  plot, 
but  also  because  they  are  typically  of  low  reliability.  This  makes  the 
"line  of  best  fit"  a  line  which  is  fitted  to  two  fewer  proportions  than 
the  number  of  stimulus  levels  used. 

(3)  If  only  4  levels  have  been  used,  there  are  but  2  points  to 
plot.  In  the  event  that  these  do  not  bracket  the  median,  the  median  is  not 
clearly  determined.  Of  our  24  series,  4  involved  4  levels  and  in  one  of 
these  the  points  did  not  bracket  the  .50  point. 

(4)  If  only  3  levels  have  been  used  (i.e.  if  stimulus  step  size 
was  very  large),  the  plot  leads  to  no  estimate  of  the  psychometric  function. 

(5)  If  some  larger  number  of  levels  has  been  used,  say  6  or  7, 
the  chances  are  quite  good  that  some  of  the  proportions  near  the  extremes 
will  be  based  upon  very  few  observations  and  that  their  unreliability  will 
introduce  appreciable  non-linearity  of  the  plot  on  normal  probability 
paper,  even  though  the  psychometric  function  is  in  fact  normal.  In  our 
data,  3  of  the  7  sessions  where  6  or  7  levels  were  used  resulted  in  plots 
which  departed  markedly  from  linearity. 

(6)  If  4  or  5  levels  have  been  used,  the  Individual  proportions 
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are  based  on  more  trials  and  thus  are  more  reliable  than  the  proportions 

obtained  when  6  or  7  levels  are  used  (given,  of  course,  that  IT  .  is  the 

tot . 

same  in  both  cases)  .  This  should  favor  greater  reliability  of  the  graphi¬ 
cally  determined  medians  in  the  4-5  level  case ,  and  presumably  better  agree¬ 
ment  of  these  medians  with  the  corresponding  values  of  m  determined  from 
the  up-and-down  series  by  the  Princeton  formula.  This  is  borne  out  in  the 
data  summarized  in  Table  11. 

Table  11  presents,  for  each  test  session,  the  value  of  m,  the  value 
of  mdn^  (the  median  read  by  interpolating  between  the  two  proportions 
which  spanned  the  .50  point)  ,  values  of  mdrVyL  and  SD^  (median  and  standard 
deviation  read  from  the  vlaual  line  of  best  fit),  and  the  value  of  5. 

From  the  table  we  may  extract  and  compare  the  data  for  those  17  ses¬ 
sions  in  which  4  or  5  stimulus  levels  were  used,  and  the  data  for  the  7 
sessions  in  which  6  or  7  levels  were  used.  The  former,  quite  obviously, 
were  sessions  where  the  values  of  I  were  lower  (average  value  of  s  ■  .83 
steps)  ,  while  the  latter  were  those  of  higher  variability  (average  value 
of  5  ■  1.38  steps).  There  were  relatively  more  non-linear  plots  which 
presented  line-fitting  problems  in  the  case  of  the  6  or  7  level  sessions, 
as  expected  from  (5)  above  (3  out  of  the  7  sessions  as  compared  with  2  out 
of  17).  The  deviations  of  mdny^  from  m  averaged  0.05  stimulus  steps  for 
4-5  level  sessions  and  0.08  for  6-7  level  sessions.  To  remove  the  effect 
of  s  on  these  measures,  we  took  as  our  index  of  the  degree  of  disparity 
between  indn^  and  m  the  absolute  value  of  (m  -  rndn^/SG.  This  quantity 
was  20%  smaller,  on  the  average ,  when  4  or  5  levels  had  been  used  than  when 
6  or  7  levels  had  been  used.  Presumably  this  difference  is  associated  with 
the  greater  reliability  of  the  Individual  proportions  in  the  4-5  level 
case,  as  expected  under  (6)  above. 

For  the  most  part,  values  of  mdn^  were  very  similar  to  mdn^.  Values 
of  mdn^  deviated  slightly  more  from  m  than  did  the  values  of  mdn^,  but  it 
is  clear  that  agreement  among  the  three  quantities  was  very  good  for  the 
most  part. 

Values  of  SD^,  it  will  be  noted,  were  typically  larger  than  cor¬ 
responding  values  of  s.  This  is  as  it  should  bo,  considering  the  fact 
that  SDy^  includes  variance  between  concurrent  series,  which  is  not  included 
in  a.  SD^  and  S  (see  page  50)  did  not  differ  systematically. 

What  we  have  found  then  is  that  with  90  test  trials  per  session  and  a 
step  size  where  4  or  5  stimulus  levels  are  used,  a  "visual  line  of  best  fit" 


Table  11 
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RESULTS  FOR  GRAPHIC  METHODS  OF  ESTIMATING  PARAMETERS 
OF  THE  PSYCHOMETRIC  FUNCTION  FROM  UP-AND-DOWN  DATA. 


Estimates  of  u 

Estimates 

Of  O’ 

Av.  of  tha 

Interpol . 

Estimate 

Av.  of  the 

Estimate 

3  estimates 

Median 

from 

3  estimates 

from 

of  u  by 

on  prob. 

visual 

of  ff  by 

visual 

Princeton 

Paper 

line  of 

Princeton 

line  of 

Group  with 

formula 

good  fit 

formula 

good  fit 

0.50  see. 

on  prob. 

on  prob. 

standard 

WTUTm 

Subl .  Day 

A 

mdn. 

_Eda_ _ 

£ 

SD — 

G  1 

+1.58 

+1.85 

VL” 

+1.80 

2.14 

^VL — 
1.55 

G  2 

+  .84 

+  .80 

+  .85 

.70 

.70 

H  1 

+1.01 

+1.05 

+1 .05 

1.00 

1.15 

+1.00 

+  .95 

+1.05 

1.01 

1.25 

M  1 

+  .41 

+  .45 

+  .45 

1.22 

1.20 

+  .19 

+  .20 

+  .20 

.83 

.90 

V  1 

+  .49 

+  .40 

+  .40 

.82 

1.05 

V  2 

-  .32 

-  .30 

-  .35 

1.08 

1.20 

Group  with 

1.58  sacs. 

Standard 

Subl .  Day 

P  1 

+  .71 

+  .65 

+  .80 

1.99 

2.25 

9  2 

+  .88 

+1.05 

+  .95 

.97 

1.20 

S  1 

+  .09 

+  .10 

+  .05 

.94 

1.00 

2 

+  .04 

+  .10 

+  .05 

.69 

.75 

G  1 

-  .41 

-  .25 

-  .50 

1.10 

1.10 

U  2 

-  .22 

-  .25 

-  .30 

.96 

1.05 

H  1 

-1.02 

-1.20 

-1.10 

1.10 

1.05 

2 

+  .01 

-  .05 

+  .10 

.57 

.70 

Group  with 

5.00  sees. 

standard 

Subl .  Day 

H  1 

-  .06 

.57 

2 

+  .04 

+  .10 

+  .10 

.89 

1.00 

B  * 

+  .09 

+  .05 

+  .05 

.56 

.70 

2 

-  .09 

-  .05 

-  .05 

.48 

.55 

F  \ 

-  .50 

-  .45 

-  .50 

1.37 

1.25 

2 

-  .75 

-  .70 

-  .80 

1.13 

1.45 

C  * 

-1.26 

-1.40 

-1.30 

1.33 

1.45 

G  2 

-  .45 

-  .40 

-  .40 

.36 

.45 

C 
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to  up-and-down  data  provides  eatiaates  of  u  and  7  which  on  the  average  are 
very  close  to  m  and  s.  The  rndn^  differed  from  m  on  the  average  by  only 
0.05  step  in  the  present  data,  and  the  value  of  SD^  differed  from  a  on  the 
average  by  about  0.10  stimulus  step.  Clearly  those  differences  reported 
above  between  bias  measures  by  the  up-and-down  method  and  by  the  method  of 
constant  stimuli  are  not  to  be  ascribed  to  peculiarities  of  the  graphic 
method  which  we  had  used  for  treating  the  constant  stimulus  data. 

d.  Experiment  CS-2:  A  comparison  of  the  method  of  constant  stimuli 
and  the  up-and-down  method  in  a  study  of  stereoscopic  discrimination.  This 
experiment  by  Kappauf  and  Arbit  was  similar  in  general  objective  to  the 
duration  study  but  followed  a  different  plan.  In  this  case  the  subject  was 
tested  on  a  preliminary  experimental  day  by  the  up-and-down  method  to  ob¬ 
tain  an  estimate  of  his  PSE  --  i.e.  the  position  of  a  variable  pin  which 
made  it  appear  equidistant  with  a  fixed  pin.  On  subsequent  days,  compari¬ 
son  stimulus  conditions  for  the  method  of  constant  stimuli  consisted  of  5 
pin  positions  which  were  symmetrically  distributed  about  this  preliminary 
PSE.  Up-and-down  measures  were  also  taken  on  these  days.  The  object  was 
to  learn  whether  these  later  estimates  of  u  would  vary  less  from  the  pre¬ 
liminary  one  in  the  case  of  the  method  of  constant  stimuli  than  in  the 
case  of  the  up-and-down  method . 

This  was  a  non-project  experiment,  supported  in  part  by  the  Research 
Board  of  the  University  of  Illinois,  but  is  reported  here  because  of  its 
relation  to  duration  experiment  above. 

(1)  Apparatus  and  general  procedure.  The  subject  observed  two 
pins  from  a  distance  of  5  meters  and  judged  which  of  the  pair  was  farther 
away.  The  pins  were  1.5  mm.  in  diameter  and  were  separated  laterally  by 
6  cm.  The  subject  viewed  the  central  vertical  segment  of  each  of  these  pins 
through  an  aperture  3  cm.  high  and  22  cm.  wide  in  a  large  white  occluding 
screen  behind  which  the  experimenter  stood.  The  pins  were  painted  black, 
and  through  the  aperture  were  seen  against  a  white  background  panel  which 
was  5.5  meters  from  the  subject.  Luminance  levels  of  the  occluding  and 
the  background  screens  were  5  and  7  foot  lamberts  respectively.  The  pins 
were  exposed  to  view  when  the  experimenter  raised  a  sliding  panel  behind 
the  opening. 

The  right  hand  pin  was  the  movable  or  variable  one.  The  subject 
was  required  to  report  the  position  of  this  pin  as  "nearer"  or  "farther" 
than  the  left  hand  one.  He  made  this  report  when  the  sliding  panel  was 
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lowered  anding  each  3-second  expoaure.  He  wae  cautioned  agalnat  making 
head  movementa  while  viewing  the  pine.  A  new  expoaure  or  trial  began  every 
10  second a. 

(2)  Preliminary  measures.  On  the  preliminary  day,  the  subject 
made  300  observations,  with  a  rest  period  allowed  after  every  50  (i.e.  about 
every  8  minutes) .  Observations  were  taken  using  3  concurrent  up-and-down 
series  and  a  stimulus  step  size  of  1  cm.  Values  of  ft  and  i  were  computed 
for  each  subject. 

(3)  Test  sessions.  Four  test  days  followed  the  preliminary  day. 

On  each  of  these,  each  subject  made  120  observations  by  the  method  of  con¬ 
stant  stimuli  and  120  observations  programmed  by  the  up-and-down  method 
using  three  concurrent  series.  Method  order  was  counterbalanced  on  succes¬ 
sive  days  for  the  the  same  subject  and  for  any  given  day  was  balanced  across 
subjects.  Observations  during  these  test  days  were  conducted  under  con¬ 
ditions  which  depended  upon  m  and  i  for  the  preliminary  day.  Those  subjects 
for  whom  s  was  between  0.7  and  2.0  stimulus  steps  were  tested  for  the 
remainder  of  the  experiment  using  a  step  size  of  1  cm.  Those  whose  standard 
deviations  were  between  2.0  and  5.0  stepB  were  tested  on  remaining  days 
using  a  2  cm.  stimulus  step.  Two  subjects  were  sufficiently  precise  in 
their  preliminary  judgments  that  a  stimulus  step  of  0.5  cm.  was  chosen  for 
them. 

Each  subject  was  tested  with  a  set  of  comparison  positions  of  the 
right  hand  pin  which  included  one  position  at  m  from  the  preliminary  run. 
Other  comparison  positions  ranged  forward  and  back  from  m  by  steps  of  the 
chosen  size.  For  testing  by  the  method  of  constant  stimuli,  the  five  chosen 
positions  were  symmetrically  distributed  about  the  preliminary  m.  For  up- 
and-down  testing,  many  comparison  positions  might  be  used. 

(4)  Subjects.  The  subjects  included  the  two  experimenters  and  6 
others  who  were  students. 

(5)  Analysis  of  the  data.  The  analysis  proceeded  in  the  same 
manner  as  for  the  duration  experiment.  For  the  method  of  constant  stimuli, 
the  proportion  of  "farther"  judgments  obtained  at  each  comparison  pin  posi¬ 
tion  on  each  experimental  day  was  plotted  on  normal  probability  paper  and 

a  line  of  best  fit  adjusted  to  the  plotted  points  by  eye.  This  was  done 
independently  by  two  judges,  and  the  mean  of  their  estimates  of  u  and  a  eras 
determined.  This  provided  a  value  of  mdn^^  and  of  SDy^  for  each  subject 
on  each  of  four  experimental  days.  Up-and-down  data  for  each  session  were 
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processed  to  provide  a  daily  value  of  m  and  i.  The  up-and-down  Judgment e 
were  also  asaembled  into  a  composite  daily  distribution  and  plotted  as 
response  proportions  to  which  a  line  was  fitted  on  normal  probability  paper. 

(6)  Resulte.  Our  results  are  summarized  in  Table  12  and  in  Fig¬ 
ure  7. 

For  our  basic  comparison  of  the  methods  of  data  collection,  we 
again  compared  ndn^  and  SD^  for  the  conetant  stimulus  data  with  m  and  s 
for  the  up-and-down  data.  For  every  one  of  the  8  subjects,  the  difference 
between  S  and  the  average  value  of  m  for  the  four  day  test  period  was 

pt€  e 

greater  than  the  difference  between  fi  #  and  the  average  value  of  mdn^ 
for  the  four  day  period.  Some  of  the  differences  for  individual  subjects 
were  small,  but  the  scatterplot  of  the  results  in  Figure  7  is  very  much 
like  that  for  the  duration  study.  Thus  we  must  conclude  again  that  the 
observed  PSE  on  any  given  day  was  contralned  to  be  near  the  middle  of  the 
comparison  series  when  the  method  of  constant  stimuli  was  used. 

Variability  of  the  daily  values  of  m,  as  measured  by  the  range 
of  these  values,  was  greater  than  the  range  of  the  daily  values  of  indn^ 
for  6  or  the  8  subjects.  Again,  as  in  the  duration  study,  we  fail  to  find 
measures  by  the  up-and-down  method  more  consistent  as  expected  from  "reli¬ 
ability"  considerations.  Rather,  we  find  relatively  less  variability  of 
the  estimates  of  u  by  the  method  of  constant  stimuli,  implying  constraint 
on  the  daily  values  of  rndn^. 

With  regard  to  estimates  of  <r,  the  method  of  constant  stimuli  gave 
larger  estimates  for  5  subjects,  the  up-and-down  method  for  3.  These  results, 
with  the  duration  data  in  Table  10  (p.  58)  ,  imply  that  the  methods  produce  equi¬ 
valent  measures  of  differential  sensitivity  as  measured  by  s  or  SD. 

Graphic  exalmlnatlon  of  response  proportions  computed  from  up- 
and-down  records  provided  results  in  complete  accord  with  those  In  the 
duration  study.  The  mdny^  for  up-and-down  sessions  agreed  with  m  within 
0.05  stimulus  steps  on  the  average  when  the  number  of  levels  used  was  4  or 
5.  The  disparity  between  these  two  measures  advanced  to  0.10  stimulus  steps 
on  the  average  when  the  number  of  levels  used  was  6  or  7. 

e .  Summary  comparison  of  the  methods.  When  the  up-and-down  method  and 
the  method  of  constant  stimuli  are  each  used  for  continued  testing  with  a 
given  subject,  the  method  of  constant  stimuli  provides  PSE's  which  are 
more  stable  and  which  depart  less  from  the  POE  than  does  the  up-and-down 
method.  In  an  operational  sense  the  up-and-down  method  is  the  less  reliable, 
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Table  12 

DATA  FOR  STEREOSCOPIC  JUDGMENT, 

OBTAINED  BY  THE  UP-AND-DOWN  METHOD  AND  BY  THE  METHOD  OF  CONSTANT  STIMULI 


Note:  Ae  in  Table  10,  valuee  of  a  and  mdn^  are  given  In  stimulus  steps.  Here 
they  are  given  as  deviations  from  the  preliminary  fi  described  in  the  text. 


Subject 

Size  of 

stimulus 

step 

Average  value  of: 

Range  o 

f  values  of: 

Range  of  values  of: 

m  over 

4  days 

mdn^  over 
4  days 

m  over 

4  days 

mdtiy^  over 

4  days 

m  over 

2  days 
when' 
up-down 
trials  were 
first 

mdnVL  over 

2  days 
when 

const,  stim 
trials  were 
first 

A 

2  cm. 

+1.83 

+0.91 

1.11 

.47 

1.11 

.30 

G 

2  cm. 

-1.26 

-1.18 

.73 

.88 

.16 

.87 

R  , 

1  cm. 

-0.10 

-0.05 

.45 

.40 

.15 

.32 

K 

0.5  cm. 

+0.39 

+0.21 

.76 

.54 

.14 

.47 

F 

1  cm. 

+0.08 

+0.02 

.49 

.63 

.38 

.35 

H 

1  cm. 

+1.44 

+0.22 

.99 

.15 

.99 

.15 

J 

1  cm. 

-1.09 

-0.60 

2.19 

1.03 

2.19 

.52 

K 

0.5  cm. 

+0.38 

+0.22 

1.35 

.20 

1.00 

.10 

AVERAGE  VALUE  OF  ndn^  -  ipr^ 
(data  for  nathod  of  constant  stimuli) 


Flgura  7:  Ralatlon  betvean  aatlnatea  of  u  by  tha  nathod  of  con¬ 
stant  stimuli  and  by  tha  up-and-down  nathod  in  tha  atudy  of  ateraoacopic 
Judgmanta.  Tha  plottad  points  ara  for  tha  individual  aubjaeta.  Both  axaa 
ara  scalad  in  stinulus  steps. 
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although  theoretically  It  la  the  more  reliable.  The  foregoing  experiments 
are  interpreted  to  mean  that  results  for  the  method  of  constant  stimuli 
are  of  epuriously  high  consistency  because  of  the  operation  of  stimulus 
context  factors  associated  in  the  fixed  set  of  comparison  stimuli. 

It  was  noted  above  that  Fernberger  (1913)  had  devised  the  scheme  of 
embedding  trials  by  the  siethod  of  limits  in  a  program  of  trials  by  the 
method  of  constant  stimuli  in  order  to  conceal  the  sequential  character  of 
the  method  of  limits  series.  Suppose  now  we  make  a  full  turn  about  on 
this  scheme  and  embed  trials  by  the  method  of  constant  stimuli  in  a  pro¬ 
gram  of  trials  by  the  up-and-down  method  (perhaps  two  concurrent  series)  . 
Our  intent  would  be  to  "conceal"  the  stimulus  context  provided  by  the  com¬ 
parison  stimuli  on  constant  stimulus  trials.  We  would  accomplish  this  by 
merging  this  context  with  the  stimulus  context  of  the  up-and-down  trials. 

In  other  words,  both  up-and-down  and  constant  stimulus  methods  would  share 
the  same  stimulus  context.  We  expect  that  the  estimates  of  u  should  then 
vary  for  the  two  methods  in  the  direction  indicated  by  theory.  This  ex¬ 
periment  has  not  yet  been  run,  but  deserves  early  attention. 

17.  A  COMPARISON  OF  THE  UP-AND-DOWN  METHOD  AND  THE  METHOD  OF  LIMITS. 

The  method  of  limits,  in  its  traditional  form,  employs  a  sequential 
scheme  of  stimulus  presentation  and  in  this  respect  is  a  member  of  the 
same  family  of  measurement  methods  as  the  up-and-down  method.  In  the 
method  of  limits,  conditions  are  changed  in  an  orderly  way  from  trial  to 
trial  on  the  basis  of  the  subject's  responses.  Every  response  X  deter¬ 
mines  that  conditions  for  the  next  trial  will  move  one  step  along  the 
stimulus  scale  in  the  direction  of  being  more  favorable  to  the  occurrence 
of  response  Y.  The  occurrence  of  a  Y  response  (or  some  prescribed  number 
of  Y  responses)  identifies  the  "limit"  for  that  series  of  trials,  where¬ 
upon  the  experimenter  may  begin  a  new  series  by  jumping  back  to  a  stimulus 
level  where  X  is  again  almost  certain  to  occur.  The  new  series  of  trials 
proceeds  as  did  the  first.  In  this  form,  the  method  of  limits  could  be 
described  as  "a  modification  of  the  up-and-down  method"  with  stimulus 
steps  in  one  direction  many  times  larger  than  the  steps  taken  in  the  other 
direction.  Oppositely,  the  up-and-down  method  is  likened  by  many  to  a 
"continuous"  or  "progressive"  form  of  the  method  of  limits,  and  we  find 
Guilford  (1934)  considering  the  up-and-down  method  aa  one  of  the  variations 
of  the  method  of  limits.  More  specifically,  we  may  say  that  the  similarity 
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of  the  two  methods  in  so  far  «s  their  date  collection  operations  are  con¬ 
cerned  resides  in  the  fact  that  their  programs  of  trials  both  follow 
Markov  designs  (see  Smith,'  1961)  , 

It  should  be  apparent,  then,  that  the  method  of  limits  and  the 
up-and-down  method  must  be  closely  related  in  certain  of  their  quantita¬ 
tive  features.  Interestingly,  although  the  method  of  limits  has  seen 
considerable  use  by  psychologists,  there  has  been  relatively  little  dis¬ 
cussion  of  its  quantitative  properties.  We  take  this  opportunity  therefore 
to  consider  some  of  these  properties  in  a  direct  comparison  of  the  method 
of  limits  with  the  up-and-down  method. 

*•  Computation  of  the  expected  distribution  of  trials  and  the  expected 
distribution  of  limits .  The  computation  of  these  expected  distributions 
proceeds  in  similar  fashion  for  both  methods,  based  simply  upon  the  res¬ 
ponse  probabilities  at  each  stimulus  level  and  upon  the  assumption  of 
trial-td-trlal  independence . 

Urban  (1907,  1908)  appears  to  have  been  the  first  to  reason  from  response 
probabilities  and  indicate  the  manner  in  which  one  may  derive  the  expected 
distribution  of  limits  under  the  method  of  limits.  Suppose  that  we  have 
a  series  of  response  probabilities  such  that  at  stimulus  level  i,  qt  is 
the  probability  of  response  X  and  is  the  probability  of  response  Y. 

For  convenience  here  we  assume  that  q^  is  so  close  to  1.00  that  we  may 
take  its  value  to  be  1.00.  Then  the  following  computations  apply: 


Stimulus 

Prob.  of 

Prob.  of 

Prob.  of 

Prob.  of 

Level 

response 

X 

response 

Y 

last  X 
occurring 
at  this 
level. 

first  Y 
occurring 
at  this 
level. 

1 

qx-  1.00 

px-  .00 

qlp2 

Pj-  .00 

2 

q2 

*2 

qlq2p3 

qlp2 

3 

q3 

P3 

q1q2q3p4 

q1q2p3 

etc. 

In  the  last  column  we  have  the  probability  distribution  of  limits,  i.e.  the 
expected  distribution  of  limits,  under  the  condition  that  we  define  the 
limit  for  each  trial  series  as  that  stimulus  level  at  which  the  first  Y 
response  occurs.  In  the  next  to  last  column  we  have  the  expected  distri¬ 
bution  of  limits  if  we  define  the  limit  for  each  series  of  trials  as  that 
stimulus  level  at  which  the  last  X  response  occurs.  Clearly  these  two 
distributions  are  identical  except  for  the  fact  that  one  if  offset  from  the 
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other  by  on*  stimulus  level. 

Working  back  from  the  first  of  those  distributions  of  limits  vs  may 
quickly  establish  the  distribution  of  trials  which  would  ba  accumulated  if 
many  ssrlas  of  trials  wars  conducted. 

b.  Dependence  of  the  expected  distributions  of  trials  sad .of  limits 
upon  stimulus  step  site.  Just  as  the  expected  distribution  of  trials 
depends  on  step  sisa  in  the  up-and-down  method,  so  the  expected  distri¬ 
butions  of  trials  add  of  limits  depend  on  step  else  in  the  method  of  limits. 

The  first  to  recognise  the  Influence  of  step  else  on  the  limits  ob¬ 
tained  in  individual  series  of  observations,  and  hence  on  the  average  limit 
obtained  for  a  number  of  series,  appears  again  to  have  been  Urban  (1907, 
1908) .  Urban  noted  that  it  was  common  practice  for  experimenters  to  pre¬ 
fer  small  step  slsas  because  these  presumably  would  assure  greater  precision 
in  measuring  the  limit.  His  probability  analysis,  however,  led  him  to 
observe  that  smell  step  sizes  shift  the  expected  value  of  the  limit  in  the 
direction  of  "anticipation."  Hla  summary  comment  was:  "...  the  result  of 
a  determination  of  the  threshold  by  the  method  of  Just  preceptible  differ¬ 
ences  depends  somewhat  on  the  size  of  the  intervals  which  ere  used.  It  is 
therefore  not  necessarily  a  sign  of  Incomplete  training  of  the  subject  or 
of  his  inability  to  direct  his  attention  to  the  comparison  of  the  stimuli, 
if  series  with  small  differences  (l.e.  stimulus  steps)  fall  to  give  the 
same  results  as  series  with  large  differences."  (1908,  p.  60).  Later, 
Fernberger  (1913)  working  under  Urban's  direction  performed  a  weight 
lifting  experiment  in  which  he  systematically  varied  step  size  over  a  five¬ 
fold  range,  all  step  values  being  below  lcr.  Fernberger  did  not  comment  on 
the  matter,  but  it  is  clear  from  his  data  (his  Tables  XL1I  and  XL111)  that 
"errors  of  anticipation"  increased  as  the  stimulus  steps  grew  smaller  in 
the  manner  expected  from  Urban's  analysis. 

The  complete  picture ,.  here ,  can  be  developed  if  we  assume  sosm  specific 
form  for  the  psychometric  function,  and  carry  out  Urban's  calculations 
from  the  Insert  table  above  for  a  variety  of  step  sizes.  When  we  perform 
su-h  au  analysis,  using  step  sizes  ranging  from  O.lo  to  3.0c  and  assuming 
the  psychometric  function  to  be  the  cumulative  normal,  the  results  which 
we  obtain  are  those  summarised  in  Figure  8. 

Bach  of  the  values  plotted  in  the  figure  is  the  mean  of  the  expected 
distribution  of  limits  for  some  particular  condition,  i.a.,  the  "expected" 
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Figure  8.  The  effect  of  else  of  stimulus  step  on  the  expected 
value  of  the  limit  for  descending  series.  Values  of  the  limit  are  expressed 
relative  to  u,  the  mean-median  of  the  normal  psychometric  function.  Limits 
have  been  computed  under  three  definitions:  (a)  as  the  stimulus  level 
where  the  last  X  response  occurs;  (b)  as  the  stimulus  level  where  the 
first  Y  response  occurs;  and  (c)  as  the  average  of  the  stimulus  levels 
of  the  last  X  response  and  first  Y  response. 
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Halt  for  that  condition.  We  have  conaidarad  llalta  daflnad  In  thraa  dif¬ 
ferent  ways:  that  the  limit  ba  (l)  the  stimulus  level  of  the  last  X  response , 
(2)  the  stimulus  level  of  the  first  Y  response,  or  (3)  the  average  of  these 
two  (i.s.,  the  stimulus  level  midway  between  that  of  the  last  X  and  the 
first  Y  response) .  Mote  that  the  figure  deals  with  Halts  obtained  for 
"decreasing"  or  "descending"  stimulus  series  only.  Symmetrical  functions 
would  apply  for  "ascending"  series. 

As  we  see  from  the  figure,  if  the  limit  is  defined  as  the  stimulus 
level  where  the  last  X  response  occurs,  the  expected  limit  consistently 
occurs  before  the  mean-median  of  the  psychometric  function  is  reached,  l.e. 
it  deviates  in  the  direction  described  in  the  literature  as  "anticipation." 

If  the  limit  is  defined  oppositely  as  the  stimulus  level  where  the  first 
Y  response  occurs,  the  measure  is  biased  in  the  direction  of  anticipation 
for  small  step  sices  but  shifts  to  become  biased  in  the  direction  of  habit¬ 
uation  for  step  sices  larger  than  about  0.7<r.  Lastly,  if  the  limit  is 
defined  as  the  stimulus  level  midway  between  that  where  the  last  X  and 
that  where  the  first  Y  occurs,  the  measure  is  biased  consistently  in  the 
direction  of  anticipation,  but  this  bias  becomes  smaller  and  smaller  as 
step  sice  becomes  larger. 

We  see  then  that,  under  our  third  and  most  commonly  used  definition 
of  the  limit,  an  error  of  anticipation  is  a  statistical  property  of  the 
method  of  limits.  This  renders  meaningless  many  of  the  discussions  of 
"errors  of  anticipation"  which  have  appeared  in  the  literature  ascribing 
such  errors  to  the  subject.  And  this  was  Urban's  point,  of  course. 

An  interesting  property  of  the  functions  plotted  in  Figure  8  is  that 
they  are  essentially  independent  of  the  position  of  the  mean-median  of  the 
psychometric  function  relative  to  the  testing  levels:  expected  limits  when 
the  mean-median  is  midway  between  testing  levels  and  when  it  coincides  with 
some  testing  level,  agree  within  the  limits  of  graphing  accuracy. 

Recent  discussions  of  this  bias  in  the  "one-way"  method  of  limits  appear 
in  Anderson,  gt  gi  (1946),  McCarthy  (1949)  and  Brown  and  Cane  (1959).  A 
scheme  for  adjusting  one-way  measurea  in  terms  of  estimated  step  else  is 
given  by  Anderson  eg  gg,  but  two-way  measures  are  clearly  preferred. 

c.  The  form  of  the  expected  distribution  of  trials.  The  up-and-down 
method  is  clearly  designed  to  concentrate  trials  in  the  vicinity  of  the 
mean-median  of  the  psychometric  function,  and  how  well  it  does  this  we  have 
already  seen  (page  15) ,  For  a  one-way.  method  of  limits,  l.e.  where 
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approaches  Co  tha  mean -mad lan  are  made  from  one  direction  only,  the  method 
obviously  concentrates  trials  at  that  one  end  of  the  stimulue  scale.  But 
if  both  ascending  and  descending  series  are  run,  as  is  usually  the  case, 
then  the  method  of  limits  provides  some  concentration  of  trials  near  the 
mean-median  when  the  step  sise  is  not  too  small.  Thus  when  step  size  is 
between  l<r  and  2a,  the  range  of  interest  in  earlier  discussions,  more 
triale  (by  a  factor  which  is  admittedly  not  large)  occur  at  levels  near  the 
mean-median  than  at  each  level  at  the  extremes  of  the  stimulus  scale.  In 
general,  however,  the  distribution  of  trials  by  the  method  of  limits 
resembles  the  rectangular  distribution  of  the  method  of  constant  stimuli 
much  more  than  it  does  the  peaked  distribution  of  trials  obtained  with 
the  up-and-down  method. 

d.  Bias  as  a  function  of  initial  testing  level.  It  was  noted  that 
initial  testing  level  introduces  bias  in  the  estimate  of  n  by  the  up-and- 
down  method  if  that  level  is  far  from  u  and  if  step  size  is  small.  In  the 
case  of  the  method  of  limits,  a  similar  source  of  bias  exists  —  bias  to 
be  added  algebraically  to  that  shown  in  Figure  8.  Computations  for  Figure 
8  assumed  that  each  stimulus  series  would  begin  far  enough  from  u  that  the 
probability  of  response  X  was  exceedingly  close  to  1.00.  If  the  Initial 
testing  level  is  at  a  level  where  the  probability  of  response  X  is  not 
close  to  1,00,  the  expected  limit  will  deviate  less  in  the  direction  of 
anticipation  than  shown  in  the  figure  (See  Anderson,  ejt  a!,  1946,  p.  106). 

It  is  interesting  to  note  in  this  connection,  that  in  Fernberger's 
methodological  comparison  of  the  method  of  limits  and  the  method  of  con¬ 
stant  stimuli  (1913)  he  computed  expected  limits  for  ascending  and  descend¬ 
ing  serlee  and  compared  their  average  with  the  mean-median  of  the  psycho¬ 
metric  function.  These  values  should  have  agreed  in  Fernberger's  case— 
a  cumulative  normal  psychometric  function.  But  he  found  discrepancies 
(see  his  Tables  XLVI1I  and  XL IX) .  What  happened  was  that  he  overlooked 
the  fact  that  his  range  of  stimulue  levels  did  not  push  the  response  probab¬ 
ilities  sufficiently  close  to  1.00  and  .00.  His  computed  expected  limits 
in  both  ascending  and  descending  directions  were  therefore  biased  and  their 
average  was  in  error.  Fernberger's  evaluating  comments  on  the  method  of 
limits  were  therefore  unjustified. 

•  •  Preferred  step  size.  Just  as  step  size  should  be  reasonably  large 
for  the  up-and-down  method,  it  appears  that  it  should  be  similarly  large 
when  tho  method  Of  limits  is  used  to  estimate  Large  step  sizes 
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involve  lose  bin*  of  the  one-way  limit  defined  ae  the  stimulus  level  midway 
between  that  of  the  last  X  reaponae  and  that  of  the  first  Y  response, 
involve  lees  bias  associated  with  initial  testing  level  (a  relation  we  nay 
infer  from  Fernberger's  data),  and  clearly  require  fever  trials  per  limit. 

f.  Estimate  of  u  when  the  psychometric  function  is  not  symmetrical. 

The  average  of  an  equal  number  of  ascending  limits  and  descending  limits 
provides  an  unbiased  estimate  of  the  mean-median  of  the  psychometric  func¬ 
tion  when  the  latter  is  symmetrical.  When  there  la  assymmetry  in  this 
function,  however,  those  series  which  start  from  the  end  where  the  function 
has  the  longer  tail  will  result  in  an  average  limit  which  will  depart  more 
from  the  median  of  the  psychometric  function  than  those  aeries  which  ap¬ 
proach  the  median  from  the  other  direction.  Hence  the  average  of  ascending 
and  descending  limits  will  deviate  from  the  median  of  the  psychometric 
function  in  the  direction  of  the  longer  tall,  just  as  m  does  in  the  case 

of  the  up-and-down  method. 

g.  Continued  observation  by  the  same  subject  and  the  problem  of  his 
knowledge  of  the  stimulus  sequence.  From  the  time  of  its  earliest  use, 
the  method  of  limits  posed  a  problem  —  what  to  do  about  the  subject's 
Insight  into  or  knowledge  of  the  stimulus  sequence  from  trial  to  trial. 

We  have  already  discussed  this  problem  with  regard  to  the  up-and-down 
method.  As  far  as  the  method  of  limits  is  concerned,  three  ways  of  hand¬ 
ling  the  problem  have  been  advanced  and  adopted  by  different  experimenters. 
(1)  Give  the  subject  full  knowledge  of  the  sequence.  Wundt  proposed  this, 
arguing  that  only  with  knowledge  of  the  sequence  would  the  subject's  at¬ 
titude  for  observation  be  most  favorable  at  the  critical  moment  when  the 
limit  was  reached.  Guilford  (1954)  and  Woodworth  and  Schlosberg  (1954) 
support  the  use  of  the  method  in  this  form  in  spite  of  the  recognition  that 
successive  judgments  in  the  series  will  be  Interdependent.  The  view  here 
is,  in  effect,  that  the  object  of  the  method  of  limits  is  to  collect  data 
under  the  condition  of  such  Interdependence.  (2)  Conceal  or  disguise  the 
sequence  of  trials  by  the  use  of  dummy  trials  between  trials  of  the  sequence, 
or  by  the  use  of  several  concurrent,  interwoven  series.  These  procedures 
are  like  those  advanced  for  the  up-and-down  method  of  page  31  above.  Fern- 
berger  (1913)  did  this  in  part,  by  Interspersing  trials  by  the  method  of 
limits  with  others  by  the  method  of  constant  stimuli.  Others  seem  not  to 
have  followed  his  lead.  On  the  one  hand,  our  usual  interest  in  both 
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ascending  and  descending  series  makes  the  plan  of  using  concurrent.  Inter¬ 
spersed  series  seem  reasonable.  On  the  other  hand,  the  use  of  extreme 
testing  levels  to  initiate  series  may  introduce  previous-trial  stimulus 
context  effects  which  could  disturb  our  obtained  limits.  The  strength 
of  the  latter  argument  would  appear  to  make  the  concurrent  series  strategy 
much  less  suited  to  the  method  of  limits  (where  extreme  stimulus  condi¬ 
tions  are  used)  than  to  the  up-and-down  method  (where  most  trials  are 
conducted  under  non-extreme  stimulus  conditions) .  (3)  Remove  the  sequence 

altogether  by  programming  trials  in  random  order.  This  routine  was  intro¬ 
duced  by  Kraepelin  in  1391,  and  subsequently  endorsed  by  Muller  (1904),  by 
Urban  (1908)  and  by  Titchcner  (1903)  for  at  least  some  discrimination  tasks. 
When  data  are  collected  in  this  way,  the  experimental  Session  proceeds  in 
the  same  manner  as  by  the  method  of  constant  stimuli,  but  scoring  is  by 
the  method  of  limits.  The  basic  argument  in  favor  of  random  ordering  of 
the  stimuli  is  that  it  prevents  the  development  of  any  special  "set"  as¬ 
sociated  with  the  series,  or  as  Smith  recently  puts  it  (1961)  avoids  "the 
effect  of  the  subject's  knowledge  of  the  stimulating  conditions  on  other 
than  the  relevant  sensory  basis."  Random  sequences  must  still  be  extended 
to  extreme  stimulus  levels,  however,  if  scoring  by  the  method  of  limits  is 
to  be  employed.  Otherwise  blsB  from  restricted  range  will  occur. 

All  in  all,  then,  knowledge  of  the  sequence  is  more  difficult  to  deal 
with  effectively  in  the  method  of  limits  than  in  the  up-and-down  method. 

h.  Statistical  properties  of  the  estimates  of  u.  In  the  case  of  m 
for  the  up-and-down  method,  we  have  seen  that  it  is  an  approximate  maximum 
likelihood  estimate  of  u.  This  property  does  not  apply  to  the  limit  or 
the  mean  limit,  but  there  is  one  interesting  property  which  Urban  (1908) 
observed  for  the  "one-way"  limit.  Consider  the  various  stimulus  levels 
and  the  associated  values  of  p^,  the  probability  of  a  Y  response.  When 
the  expected  distribution  of  limits  is  computed  under  the  definition  that 
the  limit  is  the  stimulus  level  where  the  first  Y  response  occurs,  the 
mode  of  that  distribution  cannot  occur  later  than  at  the  first  stimulus 
level  where  p  is  greater  than  .50.  Since  it  may  occur  before  this  level, 
however,  and  by  an  amount  which  depends  on  step  site,  this  is  not  a  very 
strong  property  of  the  limit. 

Summary  comments.  The  method  of  limits  and  the  up-and-down  method 
are  similar  in  the  sequential  character  of  their  testing  programs  and  thus 
prove  to  have  a  number  of  comparable  or  parallel  properties.  They  differ. 
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however,  in  Chair  procedures  for  aatimacing  n,  and  whan  chia  is  cha  primary 
purpose  of  a  study,  our  choice  baCvaan  using  Che  "tvo^vay"  method  of 
limits  and  the  up-and-down  method  is  clearly  in  favor  of  the  latter  on 
grounds  of  testing  efficiency  (See  Anderson,  et,  aJL,  1946). 

18.  FURTHER  USES  AND  LIMITATIONS  OF  THE  UP-AND-DOWN  METHOD. 

Thus  far  we  have  limited  our  discussion  of  the  up-and-down  method  to 
its  application  in  tvo-response,  differential  judgment  situationa.  What 
we  should  like  to  do  here  is  to  comment  briefly  on  the  scope  of  the 
method . 

a.  Applications  in  stimulus  threshold  measurement  and  in  scaling. 

The  up-and-down  method  may  reasonably  be  considered  for  application  in 
any  measurement  situation  where  the  probability  of  response  function  ranges 
from  probability  .00  (or  close  thereto)  to  probability  1.00  (or  close 
thereto) .  The  method  will  therefore  serve  well  in  single  stimulus  situa¬ 
tions  where  the  task  is  to  determine  stimulus  or  detection  thresholds— 
the  Bekesy  and  Oldfield  problem.  The  method  may  also  be  applied  in  scaling 
work  where  the  task  is  to  locate  limens  between  adjacent  response  cate¬ 
gorise.  This  particular  case  has  been  of  interest  to  McDlarmid  and  is 
represented  in  some  of  the  work  to  be  described  in  Part  II  of  this  report. 

Suppose  we  offer  our  subject  the  use  of  a  number  of  ordered  response 
categories— providing  him  with,  say,  four  numbered  response  keys  and 
requiring  that  he  rate  the  loudness  of  each  stimulus  which  he  hears  on  a 
scale  from  1  to  4.  For  this  situation,  the  testing  program  is  based  on 
four-minus -one  or  three  concurrent  up-and-down  series.  For  one  of  these 
series,  each  new  trial  moves  up  one  stimulus  step  (to  a  more  intense  tone) 
following  a  response  of  "1",  but  moves  down  one  step  following  a  response 
of  "2",  "3",  or  "4".  This  series  hunts  for  the  limen  between  response 
categories  "1"  and  "2".  Similarly  the  second  series  moves  up  one  stimulus 
level  after  a  response  of  either  "1"  or  "2",  but  down  after  "3"  or  "4". 

The  third  series  moves  down  only  after  a  response  of  "4"  and  moves  up 
otherwise.  The  order  of  taking  trials  from  the  three  series  is  random. 

In  our  work,  we  have  used  this  scaling  procedure  to  evaluate  the 
variability  of  a  subject's  differential  judgments.  He  listened  to  pairs 
of  stimuli  and  in  judging  the  difference  between  them  was  allowed  four 
response  categories,  which  may  be  paraphrased  as  follows:  (1)  "certain 
the  second  tone  was  louder,"  (2)  "thought  the  second  one  louder,"  (3) 
"thought  the  second  one  Bofter,"  (4)  "certain  the  second  was  softer."  As 
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will  b«  lean  la tar ,  thia  tachnlqua  provided  vary  Informative  data  relative 
to  our  atimulua  context  problem. 

b.  The  location  of  percentile  oointa  other  than  the  median  of  the 
psychometric  function.  In  principle,  the  up-and-down  method  can  be  uaed 
to  eatimate  any  percentile  point  on  a  probability  of  reaponee  curve. 
Normality  and  a  .00  to  1.00  probability  range  are  aaaumed  (Anon.,  1944). 

It  ia  recognized,  however,  that  other  aequentlal  methods  should  be  superior 
to  the  up-and-down  method  for  the  determination  of  high  or  low  percentile 
points,  methods  which  will  concentrate  trials  in  the  vicinity  of  the 
desired  points  rather  than  near  the  median.  Such  methods  include  variants 
on  the  up-and-down  method  and  other  "staircase"  methods,  including  the 
one-way  method  of  limits  (see  Anderson,  et  al ,  1946;  Smith,  1961). 

c.  Problems  where  the  probabllltv-of-response  curve  does  not  fall  to 
zero.  The  typical  case  where  the  psychologist  is  interested  in  estimating 
the  75th  percentile  point  is  not  the  case  Just  discussed,  but  the  case 
where  the  psychometric  function  ranges  from  probability  1.00  to  .50  and 
levels  off  at  the  latter  value.  Such  a  situation  arises,  for  example, 
when  we  would  determine  a  stimulus  threshold  in  a  two-choice  situation 
where  random  behavior  results  in  501  correct  responses.  Here,  it  is  clear, 
the  up-and-down  method  as  such  fails  because  there  is  no  guarantee  that 
the  trial  series  will  turn  around  at  the  lower  end  of  the  stimulus  scale. 

It  might  be  thought,  however,  that  modifications  of  the  method  would 
assure  some  success  in  keeping  observations  at  the  higher  stimulus  levels. 
Two  modifications  suggest  themselves.  One  of  these  is  to  move  up  by  large 
steps,  say  2  to  4  times  as  large  as  the  steps  need  in  moving  down.  The 
other  la  to  generate  an  up-and-down  sequence  on  the  basis  of  blocks  of 
trials,  where  all  trials  in  any  block  are  run  at  the  same  stimulus  level; 
e.g.,  move  down  one  atep  if  all  trials  in  a  3-trial  block  are  correct, 
but  move  up  one  atep  as  soon  as  the  first  failure  in  a  3-trial  block  is 
observed.  Examination  of  the  expected  trial  distributions  under  these  two 
modifications  indicates  that  the  second  one  should  be  the  more  effective 
of  the  two,  provided  that  blocks  of  at  least  3  trials  are  used.  Analysis 
of  the  data  would  entail  establishing  a  line  of  best  fit  to  the  response 
proportions  at  all  stimulus  levels',  to 'protide  the  751  point.  This  pro¬ 
cedure,  using  only  the  programming  feature  of  the  up-and-down  method,  should 
be  superior  to  the  method  of  constant  stimuli  because  it  is  sure  to  concen¬ 
trate  trials  appropriately.  (We  note  in  passing  that  m  computed  for  this 


79 


"block*  up-and-down"  sequence  will  not  ••tints  particular  percentile 
point  of  the  p*yohon*trlc  function,  but  may  prove  a  useful  statistic  for : 
comparing  experimental  conditions,  as  it  did  for  Heinsmann,  1961). 

d.  The  context  problem.  Because  the  up-and-down  method  does  not  require 
many  trials  per  measurement,  it  is  particularly  well  suited  to  work  on  the 
effect  of  previous  trials  on  a  current  judgment.  The  strategy  is  to  use 
several  concurrent  series  with  the  same  subject,  and  to  set  up  these  series 
so  that  they  differ  as  to  standard,  preceding  trial  condition,  etc.  Speci¬ 
fic  features  of  these  testing  procedures  will  be  described  in  Part  II. 


19.  SOME  COMMENTS  ON  THE  VARIABILITY  OF  s. 

a.  The  estimated  standard  error  of  s.  On  the  basis  of  the  fact  that 
s  is  an  approximate  maximum  likelihood  estimate  of  or,  the  Princeton  Research 
Group  (Anon,  1944;  Dixon  and  Mood,  1948)  derived  a  suitable  estimate  of  the 
standard  error  of  a: 
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where  H  is  a  quantity  which  varies  with  step  sise  and  with  the  position  of 
u  relative  to  the  testing  levels.  H  is  almost  everywhere  larger  than  1, 
and  over  the  range  of  step  sizes  of  greatest  interest  to  us  (i.e.  between 
la  and  2a),  H  has  an  average  value  of  about  1.3  and  never  exceeds  1.4.  If 
we  are  satisfied  to  know  H  to  within  101,  we  may  take  it  to  be  1.3  regard¬ 
less  of  the  position  of  u  relative  to  the  testing  levels.  If  we  would  be 
conservative,  we  may  take  the  value  1.4. 

It  is  of  Interest  to  recall  that  our  estimate  for  the  standard  error 
of  our  traditional  estimate  of  a  (i.e.  \j  £x^/(N-l))  is: 


’(traditional) 


/aT  1.4 /T 


Comparing  this  with  the  value  for  s 


(up-down) 


when  H  ■  1.3  or  1.4,  we  see 


that  for  a  given  value  of  s  and  a  given  value  of  N,  the  standard  error  of 

•(up-down)  18  ju8t  about  tvlca  ai  lar*«  *•  th«  atandard  error  of  •(tradltloMl) 

Thus  the  variance  of  •(Up_down)  !•  approximately  four  times  as  large  as  the 

variance  of  s,.  ... 

(traditional) 

b.  Implications  for  a  test  of  homogeneity  of  values  of  s.  From  the 

2  2 
above  information  we  recognize  that  •(Up.(jown)  cannot  be  distributed  as  JCT 
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In  the  Banner  of  s 


Hence  test*  of  homogeneity  of  variance 


w*  «%e  kci(.a  va  uuuvgcue  a.  \.y  vj.  l«UbO 

which  we  have  for  •ftraditlonal)  «e  not  directly  applicable  for  ■(up_down) . 

A  test  like  Hartley's  F  test,  however,  would  be  a  useful  one  to  have, 

max 

and  this  has  led  us  to  explore  the  poaeibility  of  adapting  a  Hartley-type 
test  for  evaluating  the  homogeneity  of  vslues  of  s  from  different  up-and- 
down  series. 


The  Hartley  teat  (1950)  is  a  "range  test"  on  variance  estimates.  It 

2 

is  based  on  the  fact  that  log  s  (and  hence  also  log  s)  is  approximately 

normally  distributed  with  variance  equal  to  2/(df-l) .  The  distribution  of 

the  range  of  samples  from  a  normal  distribution  is  known  for  samples  of 

any  size,  k,  and  eo  95th  percentile  values  of  this  range  are  known.  What 

2 

Hartley  did  was  to  define  F  as  the  largest  of  k  values  of  s  divided  by 

ID  SIX 

the  smallest  of  the  k  values,  and  then  derive  the  95th  percentile  value  of 
Fg|ax  from  the  95th  percentile  value  of  the  range  as  follows: 


where  df  represents  the  degrees  of  freedom  in  each  estimate  of  s  ,  and 
2  2 

where  (s  /a  .  )  ae  ■  F 

max  min  .95  max 


Now  if  we  are  willing  (1)  to  presume  that  log  a. 


is  about  twice 


. . .  w  w  - - -  "(up-down)  - - 

as  variable  as  log  8(tra(j^tionai) *  preserving  the  relative  variabilities 

already  observed  for  «(up_down)  and  «(tradltlonal) .  «*d  (2)  to  proceed  a. 

if  log  ■(up_down)  were  normally  distributed,  then  we  have 


} 
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This  says  that  to  obtain  an  approximate  teat  of  the  homogeneity  of  aeveral 
valuea  of  •(up_down)  •  we  ahould  find  the  ratio  of  the  largeat  to  the 
amalleat  valuea  of  «(up.down)~n°t  the  ratio  of  valuea  of  •(up.down)-«d 
uae  Hartley* a  tabled  valuea  of  P  .  We  find  that  thia  teat  lnvolvea  a 


atatiatlc  which  appears  to  be/T  rather  than  F  .  But  when  we  remem- 

her  that  *(up_doim)  1®  iteelf  basically  a  variance  measure  (see  page  18 

above),  we  see  that  the  intended  statistic  Is  in  reasonable  form  at  that. 

In  fact,  it  closely  resembles  an  F  baaed  on  variances  of  the  trial  dis- 

max 

tributiona  used  for  computing  m  and  a. 

Thus  far  we  have  a  suggested  testing  routine  for  cases  where  step 
sizes  are  between  1?  and  2a,  but  since  the  changes  in  H  for  small  step 
sizes  would  only  make  the  test  more  conservative,  the  test  may  prove  use¬ 
ful  for  all  cases  with  step  sizes  smaller  than  2a. 

c.  Evaluation  of  the  proposed  test.  Of  course  what  we  really  need  is 
information  on  how  well  the  test  works  in  practice.  Good  evidence  for  this 
purpose  should  be  available  from  the  values  of  s  obtained  for  concurrent 


But  when  we  remem- 


aeries.  Such  series,  taken  over  the  same  time  period  on  the  same  subject, 
are  homogeneous  in  their  variance  on  a  priori  grounds.  How  does  the  test 
fare  with  such  data? 


Consider  the  40  subject-sessions,  each  with  three  concurrent  up  and 
down  series,  which  entered  into  our  analysis  in  Table  9  above  for  experi¬ 
ments  CS-1  and  CS-2.  Of  these  40  sessions,  16  were  conducted  with  steps 
between  Is  and  2s  in  size,  and  22  were  conducted  with  steps  smaller  than 
la.  The  results  of  applying  our  test  of  homogeneity  of  values  of  s  to 
these  38  sets  of  concurrent  series  are  summarized  in  Table  13.  Critical 


values  of  Fmx  were  found  by  Interpolation  in  available  tables  (Walker  and 
Lev,  1953)  for  an  N  of  14  for  experiment  CS-1  and  an  N  of  19  for  CS-2.  The 
proportions  of  "non -homogeneous"  sets  of  s  values  appear  to  be  reasonably 
close  to  the  intended  .05  and  .01  rejection  levels.  A  further  check  of  the 
distribution  of  log  (,Bax/*nin)  for  these  sets  of  data  against  the  distri¬ 
bution  of  the  range  for  samples  of  three  from  the  normal  distribution 
(McKay  and  Fearson,  1933)  indicates  that  the  rejection  levels  should  hold 
fairly  well.  We  thus  have  some  empirical  support  for  the  use  of  the 
criterion  of  homogeneity. 

Of  course  we  shall  never  be  concerned  with  testing  the  homogeneity  of 
concurrent  series.  Rather  our  Interest  will  be  in  testing  homogeneity  from 
day  to  day  with  the  same  subject  or  from  replication  to  replication  in 


Table  13 


EMPIRICAL  CHECK  ON  REJECTION  RATE  WHEN  a  /•  , 

max'  a  In 

IS  USED  TO  EVALUATE  HOMOGENEITY  OF  VARIANCE  (Cor  K-3> . 


No.  of 
concurrent 

eeries  par 
session 

(l.e.  no.  of 

Sessions  where 

Sessions  where 

Range  of 

No.  of 

^*max^8mln^ 
exceeded  F 

.95 

^*max/*aln^ 

exceeded  F _ 

“*.99 

■  values) _ 

Number  Prooortlon 

Humber  Proportion 

Is  -  2s 

16 

3 

2  .125 

0  .00 

below  Is 

22 

3 

1  .045 

0  .00 

(.05  expected) 

(.01  expected) 

(Data  from  experiments  CS-1  and  CS-2) 
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ons-trial-per-subj  set  experiments.  Such  a  concern  arose  la  axparlaaat  T-l 

2  2 

above  (pagaa  35-38)  and  va  Char*  uaad  tba  catlo  a  „  /a  .  .  Ha  note  hera 

max  min 

that  uaa  of  tha  praaaat  criterion  laada  to  the  accaptaaea  of  tha 

hypotheaie  of  homogeneity  for  all  thraa  aata  of  data  which  ware  dlacuaaad 
thara . 

20.  SUMMARY  AMD  EVALUATION. 

Thla  part  of  our  report  haa  bean  concernad  with  a  daacrlptlon  and 
a valuation  of  tha  up-and-down  method.  Ha  have  aaen  it  to  ba  a  sequential 
method,  applicable  to  the  atudy  of  abaoluta  or  atlmulua  thraaholda,  dif¬ 
ferential  diacrimlnat ion,  context  and  acaling  problems. 

Early  aectiona  of  our  dlacuaalon  dealt  with  the  characterietica  and 
propartiaa  of  the  method  under  the  model  that  the  probability  of  reeponse 
curve  or  paychometrlc  function  ia  a  cumulative  normal  diatribution.  Esti¬ 
mate  a  for  u  and  or  were  reviewed  and  the  effect  of  using  different  sizes  of 
atlmulua  atepa  waa  dlacuaaad.  It  was  indicated  that  the  preferred  atlmulua 
atep  for  much  paycho logical  research  will  be  between  lo  and  2 a  in  size. 

Subsequently,  two  testing  arrangements  were  discussed:  one  in  which 
each  trial  on  tha  up-and-down  aeries  is  conducted  uaing  a  different  sub¬ 
ject  (tha  one-trlal-per-subject  procedure),  and  the  other  in  which  all 
observations  are  made  on  the  same  subject  and  the  up-and-down  sequence  ia 
concealed  through  the  use  of  several  concurrent  aeries  (the  concurrent 
series  procedure)  . 

For  tha  one-trial-per-subjact  procedure,  variability  of  the  up-and- 
down  series  la  a  function  of  within  subject  variability  and  between  subject 
variability,  aa  summarized  in  Table  14.  Succeaeive  trials  are  clearly 
independent  and  the  normality  aspect  of  the  model  for  the  up-and-down  method 
would  appear  to  be  satisfied  with  a  suitable  choice  of  stimulus  scale.  As 
indicated  at  the  bottom  of  Table  14,  the  eetimata  of  the  standard  error  of 
a  ae  provided  by  the  up-and-down  method  gives  indications  of  being  conser¬ 
vatively  large. 

Whan  the  up-and-down  method  is  used  to  study  the  diecriaination  of  a 
single  subject,  the  model  cannot  be  fully  satisfied  if  interdependencies 
exist  between  successive  trials.  Although  ths  use  of  concurrent  series 
serves  to  eliminate  Interdependencies  arising  from  insight  into  the  charac¬ 
ter  of  the  up-and-down  series,  there  is  no  way  of  avoiding  dependencies 
associated  with  drift  of  the  PSB  or  with  the  effect  of  the  stimuli  for  each 


Table  14 


VARIANCE  COMPONENTS  AND  VARIABILITY  MEASURES 
IN  CNE-TRIAL-PER-SUBJECT  TESTING. 


Variability  measures 


Sources  of  variance: 

e: 

for  a 

single 

one-trial- 

per-subject 

series 

s  : 

a 

as 

computed 
froa  1 

•_ 

“obe(SG) 
observed 
variability 
of  a  for 
replications 
with  the 

Saae  Group 
of  sublecte 

Variance  of  the  psy¬ 
chometric  function  for 
the  Individual  subject, 
at  any  given  time. 

(*  indicates 

* 

that  source  contributes 

*  * 

Variance  of  4  for  the 
individual  subject  froa 
tine  to  time  (i.e. 
drift  in  the  location 
of  the  probability  of 
response  curve) :  the 
basis  of  saapllng  var¬ 
iance  aaaoc.  with  tiae 
of  Seating  each  abject. 

* 

* 

* 

Variance  in  u  froa  sub¬ 
ject  to  subject  in  a 
given  saaple  of  subjects 

* 

• 

* 

* 

Variance  in  4  between 
Independent  groups  of 
subjects,  i.e.  saapllng 
variance  with  regard  to 
aubjacta. 

Por  3  experiments 

•a 

>  “obe(SG) 

For  1  experiment 

•a 

> 

a 

“ohs(IG) 
observed 
variability 
o£  a  for 
replications 
with  Indepen¬ 
dent  Groups 
of  subjects 
to  var.  aeas.) 

* 


* 


* 


* 


“obe(IG) 
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trial  upon  the  Judgment  on  tha  trial  to  follow.  Of  eouraa  all  psychophy- 
aieal  aaaauramant  muat  aithar  liva  with  thaaa  dapandanciaa  or  quantify  than 
in  aona  way.  With  tha  up-and-down  mat hod  va  hava  a  proeadura  which  ahould 
lat  ua  examine  and  quantify  thaaa  dapandanciaa  in  an  afficiant  manner . 

For  tha  uaa  of  concurrent  aariaa  with  tha  aana  subject,  tha  varia- 

bilitiaa  with  which  wa  hava  to  daal  arc  thoaa  summarised  in  Table  15.  This 

table  includes  reference  to  observed  variability  between  concurrent  aeries 

(sa  ),  which  represents  a  "split -halvas"  kind  of  reliability,  and  to 

obs(con) 

observed  session-to-seaslon  variability  (a-  ) ,  which  is  a  teat- 

nobs(sess) 

retest  reliability  naaaure.  We  find,  as  in  the  other  areas  of  psycholo¬ 
gical  research,  that  split-halvas  reliability  often  exceeds  test-retest 
reliability. 

Evidence  has  been  presented  to  show  that  for  the  determination  of  a 
PSE,  the  up-and-down  method  has  advantages  over  both  the  method  of  constant 
stimuli  and  the  method  of  limits.  It  provides  measures  which  may  be  des¬ 
cribed  as  more  valid  than  measures  by  the  method  of  constant  stimuli,  where 
the  obtained  PSE  is  constrained  to  be  near  the  POE.  The  up-and-down  method 
also  provides  PSBa  which  are  potentially  leas  biased  then  PSEs  obtained  by 
the  method  of  limits.  Because  the  up-and-down  method  hunts  for  the  PSE 
and  concentrates  trials  naar  the  PSE,  satisfactory  measurements  can  frequent¬ 
ly  be  made  with  this  method  on  the  basis  of  a  very  small  number  of  trials. 
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Tabla  15 

VARIANCE  COMPONENTS  AND  VARIABILITY  MEASURES 
WHEN  CONCURRENT  SERIES  ARE  RUN  WITH  THE  SAME  SUBJECT. 


A.  Source a  of  variance: 

Variance  of  the  proba¬ 
bility  of  reaponae  curve 
or  paychoaetric  function 
of  the  aubject  at  any 
given  time. 


Variability  aeaeurea 


a: 

nHHBI 

a- 

for  a 

and 

oba (con) 

oba(aeaa) 

aingle 

obaarved 

obaerved 

up-and-down 

•S’ 

variability 

variability 

aerlea 

aa 

of  a  for 

of  a  for 

coaputed 

concurrent 

aeaalona . 

from  i 

aerlea 

(*  lndicatea 

that  aource 

contributes 

to  var.  aeas.) 

* 

* 

* 

* 

Variance  of  u  for  that 
aubject  from  time  to 
tiaw  in  the  aame  teat 

aeeaion  (i.e.  trial-to-  *  *  *  * 

trial  variability  or 

drift  in  the  location  of 

the  probability  of 

reaponae  curve) . 

Covariance  between  con-  * 

current  aerlea. 


Variance  in  u  for  the 
aubject  froa  aeaalon  to 

aeaalon  (i.e.  long-term  * 

changea  in  the  location 
of  the  probability  of 
reaponae  curve) . 


B.  Direction  of  Experimental 

Imlas _ 

Poaltlve  covariance  between 
aerlea,  when  atep  aisa  la 

aaall.  Thla  makea.  .  ,  a  n  a 

a  ?  a  . 

oba. 

Large  aeaalon  changea  for 

aoae  aubjecta.  Thla  malcaa.  ,  ,  a-  <  a- 

«  *  V 

oba. 
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